Context Parallel SFT Support for dataset in THD format #10688

tomlifu · 2024-09-30T23:39:01Z

What does this PR do ?

This PR adds CP support for THD format and is compatible with cu_seqlen_padded in the latest CUDNN fused attention.

Steps to run SFT + CP + THD format:

Prepare packed dataset in THD format: run scripts/nlp_language_modeling/prepare_packed_ft_dataset.py to pack the dataset into THD format in desired sequence length. For example:

python <NeMo_top_dir>/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
        model.data.train_ds.file_names=[<dataset_top_dir>/squad/1_squad_train.jsonl] \
        model.data.train_ds.max_seq_length=4096 \
        +model.context_parallel_size=2 \
        +tokenizer_path=<tokenizer_path> \
        +output_dir=<output_dir> +pack_sizes=[4096] \

Run SFT on the packed dataset in THD format with the same CP size specified in the last step

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

… thd_cp_support

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

xrennvidia · 2024-10-25T07:50:22Z

Please fix DCO also.

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

… thd_cp_support

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

switiz · 2024-11-20T04:22:11Z

Could you let me know when it will be completed? I’ve been really looking forward to this feature. It works in pretrain, but it’s really strange that it doesn’t work in SFT.

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

nemo/utils/sequence_packing_utils.py

Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

Signed-off-by: tomlifu <tomzhanglf@gmail.com>

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

github-actions · 2024-11-27T19:47:42Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_dataset
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:70:0: C0301: Line too long (353/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:72:0: C0301: Line too long (173/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:73:0: C0301: Line too long (156/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:79:0: C0301: Line too long (157/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:82:0: C0301: Line too long (147/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:83:0: C0301: Line too long (178/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:84:0: C0301: Line too long (138/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:85:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:87:0: C0301: Line too long (144/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:90:0: C0301: Line too long (247/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:165:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:174:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:244:0: C0301: Line too long (137/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:247:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:272:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:277:0: C0301: Line too long (153/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:278:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:300:0: C0301: Line too long (127/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:389:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:655:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:36:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py:526:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.collections.nlp.models.language_modeling.megatron_gpt_model
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:327:0: C0301: Line too long (149/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:367:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:413:0: C0301: Line too long (126/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:460:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:791:0: C0301: Line too long (131/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1106:0: C0301: Line too long (146/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1127:0: C0301: Line too long (168/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1337:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1378:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1435:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1616:0: C0301: Line too long (132/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1617:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1619:0: C0301: Line too long (159/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1790:0: C0301: Line too long (128/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1810:0: C0301: Line too long (140/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1818:0: C0301: Line too long (155/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1839:0: C0301: Line too long (141/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1909:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1936:0: C0301: Line too long (134/119) (line-too-long)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:140:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:153:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:179:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:198:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:245:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:283:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:287:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:299:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:307:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:469:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:472:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:701:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:705:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:787:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1104:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1177:12: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1226:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1255:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1448:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1583:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1591:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1600:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:1884:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2037:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2044:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:2050:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:20:0: W0611: Unused fields imported from dataclasses (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:28:0: W0611: Unused _DataFetcherWrapper imported from lightning.pytorch.loops.fetchers (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:30:0: W0611: Unused OmegaConf imported from omegaconf (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:72:0: W0611: Unused activation_to_func imported from nemo.collections.nlp.parts.utils_funcs (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:81:4: W0611: Unused megatron.core imported as core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:82:4: W0611: Unused tensor_parallel imported from megatron.core (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:101:4: W0611: Unused init_method_normal imported from megatron.core.utils (unused-import)
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py:101:4: W0611: Unused scaled_init_method_normal imported from megatron.core.utils (unused-import)
************* Module nemo.utils.sequence_packing_utils
nemo/utils/sequence_packing_utils.py:53:0: C0301: Line too long (125/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:112:0: C0301: Line too long (127/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:121:0: C0301: Line too long (122/119) (line-too-long)
nemo/utils/sequence_packing_utils.py:122:0: C0301: Line too long (139/119) (line-too-long)
************* Module scripts.nlp_language_modeling.prepare_packed_ft_dataset
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:206:0: C0301: Line too long (157/119) (line-too-long)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:169:0: C0115: Missing class docstring (missing-class-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:175:4: C0116: Missing function or method docstring (missing-function-docstring)
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py:188:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.46/10

Thank you for improving NeMo's documentation!

xrennvidia

LGTM. Thanks.

* Add context parallel support for packed dataset in THD format * add changes with debug print * remove debug print Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fix cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * cu_seqlens and cu_seqlens_padded fix Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * more fix on cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * addressing Xiaowei's review Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * addressing more review comments Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix for the case where cp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * more fix to address Xiaowei's comment Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix the loss_mask for THD formated data Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed eos_idx[0][0] out of bounds issue Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed CP=1 case Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed thd_get_partitioned_indices assertion issue when pp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed data packing issue Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> * fixed an issue where cp>1 has different loss curves Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * remove redudant check for cu_seqlens Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed NeMo CI failure issue due to old TE version in CI Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> --------- Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Signed-off-by: tomlifu <tomzhanglf@gmail.com> Co-authored-by: tomlifu <tomlifu@users.noreply.github.com> Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>

* Add context parallel support for packed dataset in THD format * add changes with debug print * remove debug print Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fix cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * cu_seqlens and cu_seqlens_padded fix Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * more fix on cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * addressing Xiaowei's review Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * addressing more review comments Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix for the case where cp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * more fix to address Xiaowei's comment Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix the loss_mask for THD formated data Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed eos_idx[0][0] out of bounds issue Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed CP=1 case Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed thd_get_partitioned_indices assertion issue when pp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed data packing issue Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> * fixed an issue where cp>1 has different loss curves Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * remove redudant check for cu_seqlens Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed NeMo CI failure issue due to old TE version in CI Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> --------- Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Signed-off-by: tomlifu <tomzhanglf@gmail.com> Co-authored-by: tomlifu <tomlifu@users.noreply.github.com> Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

tomlifu added 4 commits August 19, 2024 16:45

Add context parallel support for packed dataset in THD format

9a2de52

Merge remote-tracking branch 'origin/main' into thd_cp_support

b1ef8f0

add changes with debug print

a11c351

remove debug print

2fd0456

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

github-actions bot added the NLP label Sep 30, 2024

tomlifu and others added 3 commits September 30, 2024 16:39

Merge branch 'main' into thd_cp_support

cf1a88d

Apply isort and black reformatting

4ad3511

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

Merge branch 'main' into thd_cp_support

43d60a3

tomlifu changed the title ~~Draft: Context Parallel SFT Support for dataset in THD format~~ Context Parallel SFT Support for dataset in THD format Oct 1, 2024

tomlifu mentioned this pull request Oct 1, 2024

adding cu_seqlens_padded to packed_seq_params.py NVIDIA/Megatron-LM#1163

Closed

xrennvidia self-requested a review October 2, 2024 17:44

xrennvidia reviewed Oct 3, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Show resolved Hide resolved

tomlifu and others added 5 commits October 8, 2024 19:16

fix cu_seqlens and cu_seqlens_padded

63447f7

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Merge branch 'thd_cp_support' of https://github.com/tomlifu/NeMo into…

e287857

… thd_cp_support

cu_seqlens and cu_seqlens_padded fix

850d9ae

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

more fix on cu_seqlens and cu_seqlens_padded

d50d88e

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Apply isort and black reformatting

adea017

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

github-advanced-security bot found potential problems Oct 12, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Fixed Show resolved Hide resolved

xrennvidia reviewed Oct 18, 2024

View reviewed changes

tomlifu added 2 commits October 21, 2024 16:44

addressing Xiaowei's review

8c76e48

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

addressing more review comments

78b3b1c

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

fix for the case where cp=1

8852431

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

tomlifu force-pushed the thd_cp_support branch from 0339a8f to 90feb4a Compare October 25, 2024 22:33

tomlifu requested review from pablo-garay and ko3n1g as code owners October 25, 2024 22:33

github-actions bot added core Changes to NeMo Core TTS ASR labels Oct 25, 2024

tomlifu and others added 7 commits November 7, 2024 19:11

Apply isort and black reformatting

ab02643

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

fixed eos_idx[0][0] out of bounds issue

2a8b21f

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Merge branch 'thd_cp_support' of https://github.com/tomlifu/NeMo into…

f930f77

… thd_cp_support

Merge branch 'main' into thd_cp_support

d3e9354

fixed CP=1 case

02bccd7

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

fixed thd_get_partitioned_indices assertion issue when pp=1

11d68b4

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Apply isort and black reformatting

463a478

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

tomlifu force-pushed the thd_cp_support branch from 6e8ac3b to 463a478 Compare November 14, 2024 04:56

xrennvidia requested a review from cuichenx November 21, 2024 20:33

cuichenx reviewed Nov 22, 2024

View reviewed changes

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated Show resolved Hide resolved

nemo/utils/sequence_packing_utils.py Show resolved Hide resolved

root and others added 2 commits November 22, 2024 09:27

fixed data packing issue

12de6bb

Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>

fixed an issue where cp>1 has different loss curves

cc236ba

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

xrennvidia reviewed Nov 26, 2024

View reviewed changes

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py Show resolved Hide resolved

tomlifu added 2 commits November 26, 2024 14:54

Merge branch 'main' into thd_cp_support

a988522

Signed-off-by: tomlifu <tomzhanglf@gmail.com>

remove redudant check for cu_seqlens

29a8dea

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

xrennvidia added Run CICD and removed Run CICD labels Nov 27, 2024

tomlifu and others added 3 commits November 27, 2024 11:28

fixed NeMo CI failure issue due to old TE version in CI

cf0e5fc

Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>

Apply isort and black reformatting

3c25b6e

Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>

Merge branch 'main' into thd_cp_support

fc42145

xrennvidia added Run CICD and removed Run CICD labels Nov 27, 2024

xrennvidia approved these changes Nov 27, 2024

View reviewed changes

pablo-garay merged commit 48349e4 into NVIDIA:main Nov 28, 2024
168 of 170 checks passed

cuichenx mentioned this pull request Jan 16, 2025

Fix nemo 1 packed sequence TE version error #11874

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context Parallel SFT Support for dataset in THD format #10688

Context Parallel SFT Support for dataset in THD format #10688

tomlifu commented Sep 30, 2024 •

edited

Loading

xrennvidia commented Oct 25, 2024

switiz commented Nov 20, 2024

github-actions bot commented Nov 27, 2024

xrennvidia left a comment

Context Parallel SFT Support for dataset in THD format #10688

Context Parallel SFT Support for dataset in THD format #10688

Conversation

tomlifu commented Sep 30, 2024 • edited Loading

What does this PR do ?

xrennvidia commented Oct 25, 2024

switiz commented Nov 20, 2024

github-actions bot commented Nov 27, 2024

xrennvidia left a comment

Choose a reason for hiding this comment

tomlifu commented Sep 30, 2024 •

edited

Loading