The official implementation of SpargeAttn, a universal sparse attention accelerating language, image, and video models.
- [2025-05-02]: 🎉SpargeAttn and SageAttention2 are accepted by ICML 2025!
- [2025-01-24]: 🎉SageAttention is accepted by ICLR 2025!
python>=3.9
,torch>=2.3.0
CUDA
:>=12.8
for Blackwell>=12.4
for fp8 support on Ada>=12.3
for fp8 support on Hopper>=12.0
for Ampere
pip install ninja # for parallel compilation
python setup.py install # or pip install -e .
-
spas_sage2_attn_meansim_cuda
: SpargeAttn based on SageAttention2. -
spas_sage_attn_meansim_cuda
: SpargeAttn based on SageAttention.
Tuning:
# sequential tuning
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune
# parallel tuning, this will use all gpu available on the machine
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune --parallel_tune
Inference:
# `--compile` is optional and will slow the first time inference.
python evaluate/cogvideo_example.py --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --compile
Note: We provide pre-tuned hyper-parameters
CogVideoX-2b_0.06_0.07.pt
that allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.
Note:
--compile
is optional and will further accelerate video generation but bring an overhead for the first video generation.
The tuning and inference usage is similar to CogVideoX.
Here’s a list of the tuned models so far, go to hugginface to see all tuned ckpt. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀
model name | example script | tuned ckpt |
---|---|---|
CogVideoX-2b | evaluate/cogvideo_example.py | link |
want2v-1.3B | evaluate/wan_example.py | link |
Flux | evaluate/flux_example.py | TBD |
Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.
![]() The quality of video generation on Mochi. |
![]() End-to-end performance of NIAH. |
If you use this code or find our work valuable, please cite:
@inproceedings{zhang2025spargeattn,
title={Spargeattn: Accurate sparse attention accelerating any model inference},
author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}
@inproceedings{zhang2025sageattention,
title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration},
author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
@inproceedings{zhang2024sageattention2,
title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}
}