利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

lizhishan1997 · 2024-10-15T11:45:33Z

Reminder

I have read the README and searched the existing issues.

System Info

利用Qwen2-VL微调模型，发现如下问题：
（1）单机多卡训练图文对或者纯文本，不管是lora或者全量，成功
（2）多机多卡训练图文对或者纯文本，不管是lora或者全量，成功
（3）单机多卡训练混合数据，lora 7b成功
（4）单机多卡训练混合数据，全量微调7b zero3+offload 不成功
（5）多机多卡训练混合数据， lora 不成功
（6）多机多卡训练混合数据，全量微调 zero3+offload，不成功

不成功的情况下是刚开始训练就卡死

另外，由于每张卡的显存是32G，Zero2训不起来，所以只能用Zero3训练了

Reproduction

...

Expected behavior

No response

Others

No response

hiyouga · 2024-10-15T14:31:08Z

目前混合数据不支持 zero3

lijiah33 · 2024-10-16T02:20:53Z

那请教下，如果72B的模型单卡显存不够怎么办？用zero2会OOM吧，无法完整加载一个模型@hiyouga

hiyouga · 2024-12-04T09:51:26Z

fixed

github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Oct 15, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Oct 15, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024

hiyouga added pending This problem is yet to be addressed and removed wontfix This will not be worked on labels Oct 16, 2024

hiyouga reopened this Oct 16, 2024

hiyouga mentioned this issue Dec 4, 2024

[data] fix vlm zero3 training #6233

Merged

2 tasks

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 4, 2024

hiyouga closed this as completed in #6233 Dec 4, 2024

hiyouga mentioned this issue Dec 5, 2024

8卡A800 deepspeed stage3全参sft qwen2vl-7b卡住，stage2正常训练 #5944

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

lizhishan1997 commented Oct 15, 2024 •

edited

Loading

hiyouga commented Oct 15, 2024

lijiah33 commented Oct 16, 2024 •

edited

Loading

hiyouga commented Dec 4, 2024

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

Comments

lizhishan1997 commented Oct 15, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Oct 15, 2024

lijiah33 commented Oct 16, 2024 • edited Loading

hiyouga commented Dec 4, 2024

lizhishan1997 commented Oct 15, 2024 •

edited

Loading

lijiah33 commented Oct 16, 2024 •

edited

Loading