Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

Closed
1 task done
lizhishan1997 opened this issue Oct 15, 2024 · 3 comments · Fixed by #6233
Closed
1 task done

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

lizhishan1997 opened this issue Oct 15, 2024 · 3 comments · Fixed by #6233
Labels
npu This problem is related to NPU devices solved This problem has been already solved

Comments

@lizhishan1997
Copy link

lizhishan1997 commented Oct 15, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

利用Qwen2-VL微调模型,发现如下问题:
(1)单机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(2)多机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(3)单机多卡训练混合数据,lora 7b成功
(4)单机多卡训练混合数据,全量微调7b zero3+offload 不成功
(5)多机多卡训练混合数据, lora 不成功
(6)多机多卡训练混合数据,全量微调 zero3+offload,不成功

不成功的情况下是刚开始训练就卡死

另外,由于每张卡的显存是32G,Zero2训不起来,所以只能用Zero3训练了

Reproduction

Uploading image.png…

...

Expected behavior

No response

Others

No response

@github-actions github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Oct 15, 2024
@hiyouga
Copy link
Owner

hiyouga commented Oct 15, 2024

目前混合数据不支持 zero3

@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Oct 15, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024
@lijiah33
Copy link

lijiah33 commented Oct 16, 2024

那请教下,如果72B的模型单卡显存不够怎么办?用zero2会OOM吧,无法完整加载一个模型@hiyouga

@hiyouga hiyouga added pending This problem is yet to be addressed and removed wontfix This will not be worked on labels Oct 16, 2024
@hiyouga hiyouga reopened this Oct 16, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 4, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 4, 2024

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npu This problem is related to NPU devices solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants