Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context Parallel SFT Support for dataset in THD format #10688

Merged
merged 32 commits into from
Nov 28, 2024

Conversation

tomlifu
Copy link
Collaborator

@tomlifu tomlifu commented Sep 30, 2024

What does this PR do ?

This PR adds CP support for THD format and is compatible with cu_seqlen_padded in the latest CUDNN fused attention.

Steps to run SFT + CP + THD format:

  1. Prepare packed dataset in THD format: run scripts/nlp_language_modeling/prepare_packed_ft_dataset.py to pack the dataset into THD format in desired sequence length. For example:
python <NeMo_top_dir>/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
        model.data.train_ds.file_names=[<dataset_top_dir>/squad/1_squad_train.jsonl] \
        model.data.train_ds.max_seq_length=4096 \
        +model.context_parallel_size=2 \
        +tokenizer_path=<tokenizer_path> \
        +output_dir=<output_dir> +pack_sizes=[4096] \
  1. Run SFT on the packed dataset in THD format with the same CP size specified in the last step

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

@github-actions github-actions bot added the NLP label Sep 30, 2024
@tomlifu tomlifu changed the title Draft: Context Parallel SFT Support for dataset in THD format Context Parallel SFT Support for dataset in THD format Oct 1, 2024
@xrennvidia xrennvidia self-requested a review October 2, 2024 17:44
tomlifu and others added 5 commits October 8, 2024 19:16
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
@xrennvidia
Copy link
Collaborator

Please fix DCO also.

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
@github-actions github-actions bot added core Changes to NeMo Core TTS ASR labels Oct 25, 2024
tomlifu and others added 7 commits November 7, 2024 19:11
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
@switiz
Copy link

switiz commented Nov 20, 2024

Could you let me know when it will be completed? I’ve been really looking forward to this feature. It works in pretrain, but it’s really strange that it doesn’t work in SFT.

@xrennvidia xrennvidia requested a review from cuichenx November 21, 2024 20:33
root and others added 2 commits November 22, 2024 09:27
Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
tomlifu and others added 3 commits November 27, 2024 11:28
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Copy link
Contributor

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:70:0: C0301: Line too long (353/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:72:0: C0301: Line too long (173/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:73:0: C0301: Line too long (156/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:79:0: C0301: Line too long (157/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:82:0: C0301: Line too long (147/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:83:0: C0301: Line too long (178/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:84:0: C0301: Line too long (138/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:85:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:87:0: C0301: Line too long (144/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:90:0: C0301: Line too long (247/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:165:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:174:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:244:0: C0301: Line too long (137/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:247:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:272:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:277:0: C0301: Line too long (153/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:278:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:300:0: C0301: Line too long (127/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:389:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:655:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:36:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:526:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.collections.nlp.models.language_modeling.megatron_gpt_model
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:327:0: C0301: Line too long (149/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:367:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:413:0: C0301: Line too long (126/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:460:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:791:0: C0301: Line too long (131/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1106:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1127:0: C0301: Line too long (168/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1337:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1378:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1435:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1616:0: C0301: Line too long (132/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1617:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1619:0: C0301: Line too long (159/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1790:0: C0301: Line too long (128/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1810:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1818:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1839:0: C0301: Line too long (141/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1909:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1936:0: C0301: Line too long (134/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:140:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:153:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:179:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:198:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:245:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:283:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:287:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:299:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:307:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:469:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:472:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:701:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:705:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:787:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1104:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1177:12: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1226:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1255:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1448:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1583:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1591:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1600:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1884:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2037:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2044:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2050:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:20:0: W0611: Unused fields imported from dataclasses (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:28:0: W0611: Unused _DataFetcherWrapper imported from lightning.pytorch.loops.fetchers (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:30:0: W0611: Unused OmegaConf imported from omegaconf (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:72:0: W0611: Unused activation_to_func imported from nemo.collections.nlp.parts.utils_funcs (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:81:4: W0611: Unused megatron.core imported as core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:82:4: W0611: Unused tensor_parallel imported from megatron.core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:101:4: W0611: Unused init_method_normal imported from megatron.core.utils (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:101:4: W0611: Unused scaled_init_method_normal imported from megatron.core.utils (unused-import)
************* Module nemo.utils.sequence_packing_utils
nemo/utils/sequence_packing_utils.py:53:0: C0301: Line too long (125/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:112:0: C0301: Line too long (127/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:121:0: C0301: Line too long (122/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:122:0: C0301: Line too long (139/119) (line-too-long)
************* Module scripts.nlp_language_modeling.prepare_packed_ft_dataset
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:206:0: C0301: Line too long (157/119) (line-too-long)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:169:0: C0115: Missing class docstring (missing-class-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:175:4: C0116: Missing function or method docstring (missing-function-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:188:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.46/10

Thank you for improving NeMo's documentation!

Copy link
Collaborator

@xrennvidia xrennvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@pablo-garay pablo-garay merged commit 48349e4 into NVIDIA:main Nov 28, 2024
168 of 170 checks passed
XuesongYang pushed a commit to paarthneekhara/NeMo that referenced this pull request Jan 18, 2025
* Add context parallel support for packed dataset in THD format

* add changes with debug print

* remove debug print

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fix cu_seqlens and cu_seqlens_padded

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* cu_seqlens and cu_seqlens_padded fix

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* more fix on cu_seqlens and cu_seqlens_padded

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* addressing Xiaowei's review

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* addressing more review comments

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fix for the case where cp=1

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* more fix to address Xiaowei's comment

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fix the loss_mask for THD formated data

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fixed eos_idx[0][0] out of bounds issue

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed CP=1 case

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed thd_get_partitioned_indices assertion issue when pp=1

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fixed data packing issue

Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>

* fixed an issue where cp>1 has different loss curves

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* remove redudant check for cu_seqlens

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed NeMo CI failure issue due to old TE version in CI

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

---------

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Signed-off-by: tomlifu <tomzhanglf@gmail.com>
Co-authored-by: tomlifu <tomlifu@users.noreply.github.com>
Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Feb 10, 2025
* Add context parallel support for packed dataset in THD format

* add changes with debug print

* remove debug print

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fix cu_seqlens and cu_seqlens_padded

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* cu_seqlens and cu_seqlens_padded fix

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* more fix on cu_seqlens and cu_seqlens_padded

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* addressing Xiaowei's review

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* addressing more review comments

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fix for the case where cp=1

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* more fix to address Xiaowei's comment

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fix the loss_mask for THD formated data

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fixed eos_idx[0][0] out of bounds issue

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed CP=1 case

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed thd_get_partitioned_indices assertion issue when pp=1

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

* fixed data packing issue

Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>

* fixed an issue where cp>1 has different loss curves

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* remove redudant check for cu_seqlens

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* fixed NeMo CI failure issue due to old TE version in CI

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

* Apply isort and black reformatting

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

---------

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Signed-off-by: tomlifu <tomzhanglf@gmail.com>
Co-authored-by: tomlifu <tomlifu@users.noreply.github.com>
Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants