-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2.5-0.5B-Instruct 8bit量化 推理输出乱码 #3091
Comments
我们查一下看看 |
先导出 onnx ,再转mnn 的方式 8bit 量化正常么? |
现在不支持 cuda 跑吧,你是用 cpu 跑的? |
模型推理是用cpu跑的。编译mnn主工程时添加了MNN_CUDA宏。但是编译mnn的android模块没有添加MNN_CUDA宏。 |
先导出onnx再转mnn是正常的。命令如下。我创建了config.json 并删除了"llm_weight": "llm.mnn.weight",可以正常运行。 python mnn/transformers/llm/export/llmexport.py --path pretrained_model/Qwen2.5-0.5B-Instruct --export mnn --dst_path mnn-output/qwen2.5_0.5b_instruct_onnx mnn/build/MNNConvert --modelFile mnn-output/qwen2.5_0.5b_instruct_onnx/onnx/llm.onnx --framework ONNX --MNNModel mnn-output/qwen2.5_0.5b_instruct_onnx/llm.mnn --weightQuantBits 8 --transformerFuse=1 --allowCustomOp ./mnn/build/llm_demo mnn-output/qwen2.5_0.5b_instruct_onnx/config.json |
收到,我们排查一下 |
已经修正 |
平台(如果交叉编译请再附上交叉编译目标平台):
Platform(Include target platform as well if cross-compiling):
ubuntu 20.04 cuda
使用最新的3.0 MNN版本导出qwen2.5-0.5b模型,4bit量化正常,8bit量化输出乱码【无论是否修改"precision": "fp16"】。
########### 4bit ############
python mnn/transformers/llm/export/llmexport.py --path pretrained_model/Qwen2.5-0.5B-Instruct --export mnn --dst_path mnn-output/qwen2.5_0.5b_instruct_mnn --quant_bit 4 --mnnconvert mnn/build/MNNConvert
./mnn/build/llm_demo mnn-output/qwen2.5_0.5b_instruct_mnn/config.json
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0
config path is mnn-output/qwen2.5_0.5b_instruct_mnn/config.json
Can't open file:.tempcache
Load Cache file error.
is_single_ = 1
load tokenizer
tokenizer_type = 3
load tokenizer Done
load mnn-output/qwen2.5_0.5b_instruct_mnn/llm.mnn ... Load Module Done!
Clone Decode Module Done!
main, 180, cost time: 2222.191162 ms
Prepare for resize opt Begin
Prepare for resize opt End
Fix: 1070 - Total: 1070, rate = 1.000000
main, 184, cost time: 249.036011 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 188, cost time: 0.010000 ms
Q: hi
A: Hello! How can I assist you today? Is there something specific you would like to know or discuss about anything in particular? I'm here to help answer questions and provide information on various topics. Please feel free to ask me any questions, and I'll do my best to help you.
############# 8bit ################
python mnn/transformers/llm/export/llmexport.py --path pretrained_model/Qwen2.5-0.5B-Instruct --export mnn --dst_path mnn-output/qwen2.5_0.5b_instruct_mnn --quant_bit 8 --mnnconvert mnn/build/MNNConvert
./mnn/build/llm_demo mnn-output/qwen2.5_0.5b_instruct_mnn/config.json 【无论是否修改"precision": "fp16"】
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0
config path is mnn-output/basemodel_0.5b_instruct_q88_300/config.json
Can't open file:.tempcache
Load Cache file error.
is_single_ = 1
load tokenizer
tokenizer_type = 3
load tokenizer Done
load mnn-output/basemodel_0.5b_instruct_q88_300/llm.mnn ... Load Module Done!
Clone Decode Module Done!
main, 180, cost time: 2159.822021 ms
Prepare for resize opt Begin
Prepare for resize opt End
Fix: 1070 - Total: 1070, rate = 1.000000
main, 184, cost time: 246.123016 ms
Prepare for tuning opt Begin
Prepare for tuning opt End
main, 188, cost time: 0.010000 ms
Q: hi
A: s
p
-ho P.
O
The text was updated successfully, but these errors were encountered: