Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
be2dc59
add kernels
AlpinDale Sep 23, 2024
a4cafa3
implement quant_llm within vLLM
AlpinDale Sep 23, 2024
4e57c0a
format
AlpinDale Sep 24, 2024
c180657
Update vllm/config.py
AlpinDale Sep 24, 2024
425c5ad
Update csrc/quantization/quant_llm/quant_llm_linear.cu
AlpinDale Sep 24, 2024
e4bc2d7
format
AlpinDale Sep 24, 2024
070990e
Update csrc/quantization/quant_llm/quant_llm_linear.cu
AlpinDale Sep 24, 2024
aa808f5
remove `quant_llm` as a quant method
AlpinDale Sep 24, 2024
b4ddabb
Update csrc/quantization/quant_llm/quant_llm_linear.cu
AlpinDale Sep 25, 2024
2cc8ba0
Update csrc/quantization/quant_llm/quant_llm_linear.cu
AlpinDale Sep 25, 2024
7539fc6
address comments: rename to fp_eXmY and ignore for clang-format
AlpinDale Sep 25, 2024
931da32
address comment: keep one source of truth for exp/weight bits
AlpinDale Sep 25, 2024
5cfbeee
use TORCH_CHECK instead of assert
AlpinDale Sep 25, 2024
298df53
remove unneeded comment about min capability
AlpinDale Sep 25, 2024
0656923
remove config filenames
AlpinDale Sep 25, 2024
3900ccf
warnings for fp16 downcast and cuda graphs
AlpinDale Sep 25, 2024
e60b702
Merge branch 'main' into quant_llm
AlpinDale Sep 28, 2024
d834201
Format
mgoin Sep 30, 2024
7b4dd80
Merge branch 'main' into quant_llm
AlpinDale Oct 27, 2024
dd85d7a
fully rename everything to fp_eXmY
AlpinDale Oct 27, 2024
3e42deb
pass `_in_feats.device()` to stream capture
AlpinDale Oct 27, 2024
81b2c0f
Revert "pass `_in_feats.device()` to stream capture"
AlpinDale Oct 27, 2024
9341a47
use the total number of tokens (flattened bsz) for split-K reduction
AlpinDale Oct 27, 2024
0bdfb5d
Revert "Revert "pass `_in_feats.device()` to stream capture""
AlpinDale Oct 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Revert "Revert "pass _in_feats.device() to stream capture""
This reverts commit 81b2c0f.
AlpinDale committed Oct 27, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 0bdfb5db554accc94f4bb790905d3354242c476e
3 changes: 2 additions & 1 deletion csrc/quantization/fp_eXmY/fp_eXmY_linear.cu
Original file line number Diff line number Diff line change
@@ -230,7 +230,8 @@ torch::Tensor fp_eXmY_linear_forward_cuda(int64_t EXPONENT, int64_t MANTISSA,
// NOTE(alpin): use at::cuda::getCurrentCUDAStream() instead of default
// stream (0) this fixes problem with CUDA graphs when used with
// torch.compile()
auto stream = at::cuda::getCurrentCUDAStream();
auto dev = _in_feats.device().index();
auto stream = at::cuda::getCurrentCUDAStream(dev);

/*
The heuristic is weight_bit - exponent_bit - 1 = mantissa_bit