AO/GemLite tensors produce incorrect outputs in vLLM #2141
Labels
integration
Issues related to integrations with other libraries, like huggingface, vllm, sglang, gemlite etc.
quantize
triaged
This is a follow-up to #2096
Exported AO/Gemlite models work correctly with Transformers but produce incorrect tokens when used with vLLM. I suspect that the QKV merging is not handled properly, which involves a call to the
.narrow()
method. However, we have already double-checked the slicing operation, and it appears to be correct.The text was updated successfully, but these errors were encountered: