Skip to content

AO/GemLite tensors produce incorrect outputs in vLLM #2141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mobicham opened this issue Apr 28, 2025 · 0 comments
Open

AO/GemLite tensors produce incorrect outputs in vLLM #2141

mobicham opened this issue Apr 28, 2025 · 0 comments
Labels
integration Issues related to integrations with other libraries, like huggingface, vllm, sglang, gemlite etc. quantize triaged

Comments

@mobicham
Copy link
Collaborator

This is a follow-up to #2096

Exported AO/Gemlite models work correctly with Transformers but produce incorrect tokens when used with vLLM. I suspect that the QKV merging is not handled properly, which involves a call to the .narrow() method. However, we have already double-checked the slicing operation, and it appears to be correct.

#pip install git+https://github.com/mobiusml/gemlite --upgrade;
#VLLM_USE_V1=0 TRITON_PRINT_AUTOTUNING=1 ipython3 ... #Make sure to disable V1!

import torch, time
from vllm import LLM
from vllm.sampling_params import SamplingParams

#model_id = "mobicham/llama3.1_8b_instruct_torchao_gemlite_4bitgs64"
model_id = "mobicham/Phi-4-mini-instruct_torchao_gemlite_4bitgs64"
#model_id = "mobicham/Llama-3.2-3B-Instruct_torchao_gemlite_4bitgs64"

llm = LLM(model=model_id, gpu_memory_utilization=0.9, dtype=torch.float16 , max_model_len=2048)
sampling_params = SamplingParams(temperature=0., top_k=1, max_tokens=1024)

prompts = ["Describe the impact of artificial intelligence on society."] 
outputs = llm.generate(prompts * batch_size, sampling_params)
print(outputs[0].outputs[0].text)
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
@jerryzh168 jerryzh168 added integration Issues related to integrations with other libraries, like huggingface, vllm, sglang, gemlite etc. quantize triaged labels Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration Issues related to integrations with other libraries, like huggingface, vllm, sglang, gemlite etc. quantize triaged
Projects
None yet
Development

No branches or pull requests

2 participants