经过sft或者dpo后，合并lora后导出的gguf，在ollama上回答效果不佳 #6020

NeilL0412 · 2024-11-13T16:00:11Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Python version: 3.11.0
PyTorch version: 2.4.1+cu121 (GPU)
Transformers version: 4.45.2
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: GRID A100X-40C

Reproduction

使用的模型是Qwen/Qwen2.5-1.5B-Instruct

sft
经过了6次sft微调，每次epoch 30次，合并lora模型后的回答效果还可以，但是转换成gguf后的回答效果差很多。
以下是合并后的lora模型：

以下是在ollama上运行，导出的gguf：

微调的参数大概是这样的：

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template qwen \
    --flash_attn auto \
    --dataset_dir data \
    --dataset it_qa \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 30.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 20 \
    --optim adamw_torch \
    --packing False \
    --mask_history True \
    --report_to none \
    --output_dir saves/Qwen2.5-1.5B-Instruct/lora/train_2024-11-13-23-25-43 \
    --bf16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --adapter_name_or_path saves/Qwen2.5-1.5B-Instruct/lora/Qwen2.5-1.5b-FT5 \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all

合并lora的参数：

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
adapter_name_or_path: /app/neil/llm/LLaMA-Factory/saves/Qwen2.5-1.5B-Instruct/lora/Qwen2.5-1.5b-FT6
template: qwen
finetuning_type: lora

### export
export_dir: mymodel/qwen2.5/qwen2.5-1.5b/FT6/lora
export_size: 2
export_device: gpu
export_legacy_format: false

dpo
经过前6次sft微调，第7次进行dpo强化学习
以下是合并后lora模型的回答：

以下是在ollama上运行的gguf，总是像以下这种问答选项的这种形式：

以下是合并lora参数：

llamafactory-cli train \
    --stage dpo \
    --do_train True \
    --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template qwen \
    --flash_attn auto \
    --dataset_dir data \
    --dataset it_qa_dpo \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 30.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 20 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves/Qwen2.5-1.5B-Instruct/lora/train_2024-11-13-23-25-43 \
    --bf16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --adapter_name_or_path saves/Qwen2.5-1.5B-Instruct/lora/Qwen2.5-1.5b-FT6 \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all \
    --pref_beta 0.1 \
    --pref_ftx 0 \
    --pref_loss sigmoid

导出的参数与sft差不多，这里就不贴了

还请各位大佬帮忙看一下是什么情况，还请各位帮忙说一下可能出现这种问题的原因和解决方法都有哪些，谢谢

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

qq1273834091 · 2024-12-10T08:46:08Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

xjtupy · 2025-01-14T07:07:04Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

NeilL0412 · 2025-01-16T02:02:32Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

找到问题啦，是template的问题，你在微调时候用的什么template，在ollama 配置Modelfile的时候就要写什么样的template，我是微调的qwen，你们参考下（注释里面的东西我也没仔细看）。

FROM ./Qwen2.5-3B-Instruct-F16.gguf

TEMPLATE """
{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}
"""

### Tuning ##

PARAMETER num_ctx 32768

### min_p sampling ##
# min_p works best with a bit of temperature
PARAMETER temperature 0.2
# 1.0 disables top_p, so we can use min_p
PARAMETER top_p 1.0
PARAMETER min_p 0.9
### min_p sampling ##

PARAMETER num_batch 1024
PARAMETER num_keep 256

#  64 fits RYS-XLarge-72b IQ4_XS at 21k
#  PARAMETER num_batch 64

## For codegen ##
#  PARAMETER num_keep 512
#  PARAMETER num_keep 1024
#  PARAMETER top_p 0.9 # default
#  PARAMETER top_k 20 # default
#  PARAMETER repetition_penalty 1.05 # default
#  PARAMETER presence_penalty 0.2
#  PARAMETER frequency_penalty 0.2
#  PARAMETER repeat_last_n 50

# VRAM increased by:
# num_batch
# num_ctx

xjtupy · 2025-01-16T02:59:03Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

找到问题啦，是template的问题，你在微调时候用的什么template，在ollama 配置Modelfile的时候就要写什么样的template，我是微调的qwen，你们参考下（注释里面的东西我也没仔细看）。

FROM ./Qwen2.5-3B-Instruct-F16.gguf

TEMPLATE """
{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}
"""

### Tuning ##

PARAMETER num_ctx 32768

### min_p sampling ##
# min_p works best with a bit of temperature
PARAMETER temperature 0.2
# 1.0 disables top_p, so we can use min_p
PARAMETER top_p 1.0
PARAMETER min_p 0.9
### min_p sampling ##

PARAMETER num_batch 1024
PARAMETER num_keep 256

#  64 fits RYS-XLarge-72b IQ4_XS at 21k
#  PARAMETER num_batch 64

## For codegen ##
#  PARAMETER num_keep 512
#  PARAMETER num_keep 1024
#  PARAMETER top_p 0.9 # default
#  PARAMETER top_k 20 # default
#  PARAMETER repetition_penalty 1.05 # default
#  PARAMETER presence_penalty 0.2
#  PARAMETER frequency_penalty 0.2
#  PARAMETER repeat_last_n 50

# VRAM increased by:
# num_batch
# num_ctx

我微调的Meta-Llama-3.1-8B-Instruct，用的llama3的模板，那Modelfile文件是直接复制llama3的模版就行，是吧？

NeilL0412 · 2025-01-16T06:15:13Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

找到问题啦，是template的问题，你在微调时候用的什么template，在ollama 配置Modelfile的时候就要写什么样的template，我是微调的qwen，你们参考下（注释里面的东西我也没仔细看）。

FROM ./Qwen2.5-3B-Instruct-F16.gguf

TEMPLATE """
{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}
"""

### Tuning ##

PARAMETER num_ctx 32768

### min_p sampling ##
# min_p works best with a bit of temperature
PARAMETER temperature 0.2
# 1.0 disables top_p, so we can use min_p
PARAMETER top_p 1.0
PARAMETER min_p 0.9
### min_p sampling ##

PARAMETER num_batch 1024
PARAMETER num_keep 256

#  64 fits RYS-XLarge-72b IQ4_XS at 21k
#  PARAMETER num_batch 64

## For codegen ##
#  PARAMETER num_keep 512
#  PARAMETER num_keep 1024
#  PARAMETER top_p 0.9 # default
#  PARAMETER top_k 20 # default
#  PARAMETER repetition_penalty 1.05 # default
#  PARAMETER presence_penalty 0.2
#  PARAMETER frequency_penalty 0.2
#  PARAMETER repeat_last_n 50

# VRAM increased by:
# num_batch
# num_ctx

我微调的Meta-Llama-3.1-8B-Instruct，用的llama3的模板，那Modelfile文件是直接复制llama3的模版就行，是吧？

都可以，只要是对应的模板就行，多试试，去源码template里面找对应模板代码也试试，也可以参考这位大佬的llm-templates

xjtupy · 2025-01-16T07:44:23Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

找到问题啦，是template的问题，你在微调时候用的什么template，在ollama 配置Modelfile的时候就要写什么样的template，我是微调的qwen，你们参考下（注释里面的东西我也没仔细看）。

FROM ./Qwen2.5-3B-Instruct-F16.gguf

TEMPLATE """
{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}
"""

### Tuning ##

PARAMETER num_ctx 32768

### min_p sampling ##
# min_p works best with a bit of temperature
PARAMETER temperature 0.2
# 1.0 disables top_p, so we can use min_p
PARAMETER top_p 1.0
PARAMETER min_p 0.9
### min_p sampling ##

PARAMETER num_batch 1024
PARAMETER num_keep 256

#  64 fits RYS-XLarge-72b IQ4_XS at 21k
#  PARAMETER num_batch 64

## For codegen ##
#  PARAMETER num_keep 512
#  PARAMETER num_keep 1024
#  PARAMETER top_p 0.9 # default
#  PARAMETER top_k 20 # default
#  PARAMETER repetition_penalty 1.05 # default
#  PARAMETER presence_penalty 0.2
#  PARAMETER frequency_penalty 0.2
#  PARAMETER repeat_last_n 50

# VRAM increased by:
# num_batch
# num_ctx

我微调的Meta-Llama-3.1-8B-Instruct，用的llama3的模板，那Modelfile文件是直接复制llama3的模版就行，是吧？

都可以，只要是对应的模板就行，多试试，去源码template里面找对应模板代码也试试，也可以参考这位大佬的llm-templates

试了几个都达不到LLama-Factory官方代码预测结果的效果，你知道官方的llama3模版是怎么定义的嘛 @NeilL0412 @hiyouga

NeilL0412 · 2025-01-17T01:39:12Z

有同样的问题不知道是不是自己操作的问题？vllm推理的时候是没有问题的生成gguf文件后再倒入到ollama中回答效果总是有问题

遇到同样的问题，有解决嘛

找到问题啦，是template的问题，你在微调时候用的什么template，在ollama 配置Modelfile的时候就要写什么样的template，我是微调的qwen，你们参考下（注释里面的东西我也没仔细看）。

FROM ./Qwen2.5-3B-Instruct-F16.gguf

TEMPLATE """
{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}
"""

### Tuning ##

PARAMETER num_ctx 32768

### min_p sampling ##
# min_p works best with a bit of temperature
PARAMETER temperature 0.2
# 1.0 disables top_p, so we can use min_p
PARAMETER top_p 1.0
PARAMETER min_p 0.9
### min_p sampling ##

PARAMETER num_batch 1024
PARAMETER num_keep 256

#  64 fits RYS-XLarge-72b IQ4_XS at 21k
#  PARAMETER num_batch 64

## For codegen ##
#  PARAMETER num_keep 512
#  PARAMETER num_keep 1024
#  PARAMETER top_p 0.9 # default
#  PARAMETER top_k 20 # default
#  PARAMETER repetition_penalty 1.05 # default
#  PARAMETER presence_penalty 0.2
#  PARAMETER frequency_penalty 0.2
#  PARAMETER repeat_last_n 50

# VRAM increased by:
# num_batch
# num_ctx

我微调的Meta-Llama-3.1-8B-Instruct，用的llama3的模板，那Modelfile文件是直接复制llama3的模版就行，是吧？

都可以，只要是对应的模板就行，多试试，去源码template里面找对应模板代码也试试，也可以参考这位大佬的llm-templates

试了几个都达不到LLama-Factory官方代码预测结果的效果，你知道官方的llama3模版是怎么定义的嘛 @NeilL0412 @hiyouga

去ollama library找，或者你微调时用的什么template，那就去源码template.py找对应的代码，转成template格式。如果还是不行那我也没办法了，好好去看文档吧……

github-actions bot added the pending This problem is yet to be addressed label Nov 13, 2024

NeilL0412 closed this as completed Jan 16, 2025

hiyouga added good first issue Good for newcomers solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jan 16, 2025

hiyouga mentioned this issue Feb 11, 2025

[misc] support export ollama modelfile #6899

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

经过sft或者dpo后，合并lora后导出的gguf，在ollama上回答效果不佳 #6020

经过sft或者dpo后，合并lora后导出的gguf，在ollama上回答效果不佳 #6020

NeilL0412 commented Nov 13, 2024

qq1273834091 commented Dec 10, 2024

xjtupy commented Jan 14, 2025

NeilL0412 commented Jan 16, 2025

xjtupy commented Jan 16, 2025

NeilL0412 commented Jan 16, 2025

xjtupy commented Jan 16, 2025

NeilL0412 commented Jan 17, 2025

经过sft或者dpo后，合并lora后导出的gguf，在ollama上回答效果不佳 #6020

经过sft或者dpo后，合并lora后导出的gguf，在ollama上回答效果不佳 #6020

Comments

NeilL0412 commented Nov 13, 2024

Reminder

System Info

Reproduction

还请各位大佬帮忙看一下是什么情况，还请各位帮忙说一下可能出现这种问题的原因和解决方法都有哪些，谢谢

Expected behavior

Others

qq1273834091 commented Dec 10, 2024

xjtupy commented Jan 14, 2025

NeilL0412 commented Jan 16, 2025

xjtupy commented Jan 16, 2025

NeilL0412 commented Jan 16, 2025

xjtupy commented Jan 16, 2025

NeilL0412 commented Jan 17, 2025