-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context Parallel SFT Support for dataset in THD format #10688
Conversation
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
… thd_cp_support
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Fixed
Show resolved
Hide resolved
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Fixed
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Please fix DCO also. |
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
0339a8f
to
90feb4a
Compare
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
… thd_cp_support
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
6e8ac3b
to
463a478
Compare
Could you let me know when it will be completed? I’ve been really looking forward to this feature. It works in pretrain, but it’s really strange that it doesn’t work in SFT. |
Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com>
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Thank you for improving NeMo's documentation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
* Add context parallel support for packed dataset in THD format * add changes with debug print * remove debug print Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fix cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * cu_seqlens and cu_seqlens_padded fix Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * more fix on cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * addressing Xiaowei's review Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * addressing more review comments Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix for the case where cp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * more fix to address Xiaowei's comment Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix the loss_mask for THD formated data Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed eos_idx[0][0] out of bounds issue Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed CP=1 case Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed thd_get_partitioned_indices assertion issue when pp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed data packing issue Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> * fixed an issue where cp>1 has different loss curves Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * remove redudant check for cu_seqlens Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed NeMo CI failure issue due to old TE version in CI Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> --------- Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Signed-off-by: tomlifu <tomzhanglf@gmail.com> Co-authored-by: tomlifu <tomlifu@users.noreply.github.com> Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
* Add context parallel support for packed dataset in THD format * add changes with debug print * remove debug print Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fix cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * cu_seqlens and cu_seqlens_padded fix Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * more fix on cu_seqlens and cu_seqlens_padded Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * addressing Xiaowei's review Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * addressing more review comments Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix for the case where cp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * more fix to address Xiaowei's comment Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fix the loss_mask for THD formated data Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed eos_idx[0][0] out of bounds issue Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed CP=1 case Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed thd_get_partitioned_indices assertion issue when pp=1 Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> * fixed data packing issue Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> * fixed an issue where cp>1 has different loss curves Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * remove redudant check for cu_seqlens Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * fixed NeMo CI failure issue due to old TE version in CI Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> * Apply isort and black reformatting Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> --------- Signed-off-by: Lifu Zhang <tomzhanglf@gmail.com> Signed-off-by: tomlifu <tomlifu@users.noreply.github.com> Signed-off-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Signed-off-by: tomlifu <tomzhanglf@gmail.com> Co-authored-by: tomlifu <tomlifu@users.noreply.github.com> Co-authored-by: root <root@cw-dfw-h100-001-074-012.cm.cluster> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
What does this PR do ?
This PR adds CP support for THD format and is compatible with cu_seqlen_padded in the latest CUDNN fused attention.
Steps to run SFT + CP + THD format:
scripts/nlp_language_modeling/prepare_packed_ft_dataset.py
to pack the dataset into THD format in desired sequence length. For example:PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.