Skip to content

adding Context Length Specialization (CCL) #401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

quic-vjanfaza
Copy link

@quic-vjanfaza quic-vjanfaza commented May 13, 2025

Context-Length-Specialization technique tries to optimize the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: vjanfaza <vjanfaza@qrc706r8-292-03.qualcomm.com>
@quic-rishinr
Copy link
Contributor

Could you please run some models on the Qeff mainline and on this PR with CCL enabled and disabled? This will help us better understand the performance impact introduced by the CCL changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants