adding Context Length Specialization (CCL) #401

quic-vjanfaza · 2025-05-13T16:03:09Z

Context-Length-Specialization technique tries to optimize the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: vjanfaza <vjanfaza@qrc706r8-292-03.qualcomm.com>

quic-rishinr · 2025-05-15T08:35:53Z

Could you please run some models on the Qeff mainline and on this PR with CCL enabled and disabled? This will help us better understand the performance impact introduced by the CCL changes.

adding Context Length Specialization (CCL)

563cac3

Signed-off-by: vjanfaza <vjanfaza@qrc706r8-292-03.qualcomm.com>

quic-vjanfaza requested review from quic-rishinr, ochougul and quic-amitraj as code owners May 13, 2025 16:03

quic-rishinr mentioned this pull request May 15, 2025

adding Context Length Specialization (CCL) #388

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding Context Length Specialization (CCL) #401

adding Context Length Specialization (CCL) #401

quic-vjanfaza commented May 13, 2025 •

edited

Loading

quic-rishinr commented May 15, 2025

adding Context Length Specialization (CCL) #401

Are you sure you want to change the base?

adding Context Length Specialization (CCL) #401

Conversation

quic-vjanfaza commented May 13, 2025 • edited Loading

quic-rishinr commented May 15, 2025

quic-vjanfaza commented May 13, 2025 •

edited

Loading