[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

AlpinDale · 2024-09-23T23:57:04Z

This PR adds support for loading FP16 models in custom Floating-Point formats at runtime. It currently supports FP4, FP5, FP6, and FP7.

Some useful features:

Loads at the specified data type, so you won't need enough memory to fit the FP16 weights (unlike the current FP8 quant in vLLM)
Preserves FP16 batched throughput
FP6 matches FP8/FP16 accuracy
Near-instant quantization; takes ~8 seconds for 8B

The method is based on FP6-LLM, and ported from aphrodite-engine/aphrodite-engine#755

FIX #8716
FIX #4515

Usage

vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct -q fp6_weights

TODO

Investigate issues with Qwen models
Implement Channel-wise quantization to improve FP4 performance
Extend support to FP8 (primarily for Ampere GPUs, as an alternative to Marlin)
Try and support Turing (low priority)

Benchmarks

Throughput

Tested on NVIDIA H100 SXM.

H100* typo in image

Expand for detailed numbers

#!/bin/bash

quantization_modes=( "fp4_weights" "fp5_weights" "fp6_weights" "fp7_weights"  "fp8")
tp_sizes=( 1 2 4)
models=("NousResearch/Meta-Llama-3.1-8B-Instruct" "mistralai/Mistral-Small-Instruct-2409")

for model in "${models[@]}"; do
	for quantization_mode in "${quantization_modes[@]}"; do
	  for tp_size in "${tp_sizes[@]}"; do
	    python benchmarks/benchmark_throughput.py \
	        --model $model$ \
	        --gpu-memory-utilization 0.98 \
	        --input-len 512 \
	        --output-len 52 \
	        --quantization $quantization_mode \
	        -tp $tp_size \
	        --max-model-len 2048 \
	        --enable-chunked-prefill
	  done
	done

Meta-Llama-3.1-8B

Data Type	TP Size	Request/s	Total Tokens/s	Weight Mem Usage (GiB)
FP4	1	24.56	13851.68	3.62
	2	32.00	18046.53
	4	34.68	19560.06
FP5	1	23.13	13044.27	6.11
	2	30.69	17307.73
	4	32.74	18462.69
FP6	1	23.24	13104.85	6.87
	2	30.91	17431.46
	4	33.14	18690.69
FP7	1	22.48	12678.27	7.74
	2	30.24	17053.81
	4	33.44	18857.93
FP8 (W8A8)	1	34.27	19328.41	8.49
	2	29.61	16699.40
	4	29.07	16392.71
BF16	1	36.29	20466.76	14.98
	2	30.70	17312.38
	4	31.47	17751.32

Mistral-Small-Instruct-2409 (FP8 OOM'd at TP1, had to quantize first):

Data Type	TP Size	Request/s	Total Tokens/s	Weight Mem Usage (GiB)
FP4	1	10.39	5859.19	10.93
	2	16.34	9214.50
	4	21.07	11885.75
FP5	1	9.51	5365.26	13.47
	2	15.20	8574.19
	4	19.91	11227.66
FP6	1	9.62	5424.02	16.07
	2	15.35	8659.38
	4	19.93	11237.89
FP7	1	9.17	5169.51	18.59
	2	14.88	8393.25
	4	19.74	11133.56
FP8 (W8A8)	1	23.23	19328.41	21.11
	2	19.61	11057.32
	4	19.77	11153.04
BF16	1	18.25	10290.51	41.44
	2	19.31	10890.39
	4	19.17	10811.15

Accuracy

Expand for more details

lm_eval --model vllm --model_args pretrained=NousResearch/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=2,add_bos_token=True,quantization=fp4_weights --tasks gsm8k --num_fewshot 5 --batch_size auto

Meta-Llama-3.1-8B-Instruct

FP4
1:15 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.6619	±	0.0130
		strict-match	5	exact_match	↑	0.6497	±	0.0131

FP5
1:15 minutes to run GSM8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7551	±	0.0118
		strict-match	5	exact_match	↑	0.6566	±	0.0131

FP6
1:17 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7930	±	0.0112
		strict-match	5	exact_match	↑	0.7293	±	0.0122

FP7
1:17 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7771	±	0.0115
		strict-match	5	exact_match	↑	0.7187	±	0.0124

FP8 W8A8
1:17 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7733	±	0.0115
		strict-match	5	exact_match	↑	0.6899	±	0.0127

BF16
1:14 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7771	±	0.0115
		strict-match	5	exact_match	↑	0.7096	±	0.0125

Mistral-Small-Instruct-2409

FP4
2:32 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7642	±	0.0117
		strict-match	5	exact_match	↑	0.7513	±	0.0119

FP5
2:46 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8211	±	0.0106
		strict-match	5	exact_match	↑	0.8097	±	0.0108

FP6
2:46 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8203	±	0.0106
		strict-match	5	exact_match	↑	0.8059	±	0.0109

FP7
2:53 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8165	±	0.0107
		strict-match	5	exact_match	↑	0.8021	±	0.0110

FP8 W8A8
1:20 minutes to run GMS8K eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8249	±	0.0105
		strict-match	5	exact_match	↑	0.8120	±	0.0108

BF16
1:33 minutes to run GMS8K

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8249	±	0.0105
		strict-match	5	exact_match	↑	0.8097	±	0.0108

cc @mgoin @robertgshaw2-neuralmagic

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Co-authored-by: intervitens <intervitens@tutanota.com>

github-actions · 2024-09-23T23:57:17Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/config.py

csrc/quantization/quant_llm/quant_llm_linear.cu

Co-authored-by: Russell Bryant <rbryant@redhat.com>

CMakeLists.txt

csrc/quantization/quant_llm/quant_llm_linear.cu

vllm/config.py

vllm/model_executor/layers/quantization/quant_llm.py

mgoin · 2024-09-24T18:35:00Z

Comparing wheel sizes from the docker build job.

before:

[2024-09-23T23:43:23Z] #31 0.944 Wheel dist/vllm-0.6.1.post3.dev102+g17ed7583.d20240923.cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (217.23 MB).

after:

[2024-09-24T14:14:40Z] #31 0.792 Wheel dist/vllm-0.6.1.post2+cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (221.51 MB).

This seems suspiciously small, but I'll take that as a positive signal!

csrc/quantization/quant_llm/quant_llm_linear.cu

tlrmchlsmth · 2024-09-24T20:32:35Z

csrc/quantization/quant_llm/quant_llm_linear.cu

+  // stream (0) this fixes problem with CUDA graphs when used with
+  // torch.compile()
+  auto stream = at::cuda::getCurrentCUDAStream();
+


Similar to @mgoin's suggestion above, Could you add the following to make sure the kernel runs on the right GPU?

at::cuda::OptionalCUDAGuard const device_guard(device_of(_in_feats));

And when you add unit tests, it would be nice to add a test for running on a GPU other than GPU 0. (We do this in tests/kernels/test_cutlass.py for example)

Tried this with little luck, but maybe I did it wrong. Will try again.

ghost · 2024-09-25T06:31:38Z

This seems to overlap/replace the FP6/8 "deepspeedfp" quantization added in #4652.

mgoin · 2024-09-25T14:25:32Z

@charai-frontend yep I think we can remove deepspeedfp after this lands

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

AlpinDale · 2024-09-25T16:56:30Z

@mgoin Since CUDA graph issues are rare, I believe we can keep eager mode disabled for this quant, but with a warning that the user should enable it if they run into any issues.

This change adds a new configuration option for vllm: a path to a Python virtual environment to use when running vllm. This allows you to run a custom version of vllm for development and testing purposes, even if that version of vllm has incompatible python dependencies with the current version of instructlab. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required: vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_venv: "/home/ec2-user/venv" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>

AlpinDale · 2024-09-26T05:22:10Z

Needle Test results for Magnum-v2-72B model (Qwen2-72B finetune) over 16384 tokens of context. FP6 seems to slightly outperform BF16.

This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>

This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>

mgoin · 2024-09-30T21:36:34Z

csrc/quantization/fp_eXmY/fp_eXmY_linear.cu

+  // NOTE(alpin): use at::cuda::getCurrentCUDAStream() instead of default
+  // stream (0) this fixes problem with CUDA graphs when used with
+  // torch.compile()
+  auto stream = at::cuda::getCurrentCUDAStream();


Did you try passing in _in_feats.device() to get the cuda stream for the current device? Usually we do this at::cuda::getCurrentCUDAStream(dev)

This change seems to break tensor parallelism with custom all-reduce:

(VllmWorkerProcess pid=2878816) INFO 10-27 21:26:21 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/austin/.cache/vllm/gpu_p2p_access_cache_for_8,9.json Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'

Reverting for now.

Update: doesn't seem to be caused by this.

vllm/model_executor/layers/quantization/fp_eXmY.py

mgoin · 2024-10-01T17:20:32Z

vllm/model_executor/layers/quantization/fp_eXmY.py

+        weights = weight.data
+        scales = weight.scales
+        out_dim, in_dim = weights.shape
+        bsize = x.shape[0]


Should bsize actually be the number of tokens? So flattened batch size and num tokens? For instance for marlin we flatten the input

vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py

Lines 275 to 276 in 4f341bd

reshaped_x = input.reshape(-1, input.shape[-1])

out_shape = input.shape[:-1] + (output_size_per_partition, )

Good point - I'll try this and get back to you (sorry for the late response, I've been busy with work)

Flattened batch size:

Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.50it/s, est. speed input: 6909.52 toks/s, output: 701.75 tok/s

First dim only:

Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.48it/s, est. speed input: 6899.52 toks/s, output: 700.73 tok/s

Likely within margin of error, I think we can go with this.

mgoin · 2024-10-01T17:24:42Z

vllm/config.py

+            DEFAULT_FP_EXMY_EXP_BITS, VALID_FP_EXMY_METHODS)
+        if self.quantization is not None and self.quantization in \
+            VALID_FP_EXMY_METHODS:
+            fp_bits = int(self.quantization[2])


This is kind of sketchy

mgoin

As mentioned offline, it would be nice to have some kernel unit tests to compare against baseline unquantized or qdq

iamthemulti · 2024-10-14T17:20:48Z

I tested this PR with the recently added GLM-4V support (multimodal model) using the following command:

python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4v-9b --quantization fp6_weights --trust_remote_code --max-model-len 8192 --gpu-memory-utilization 0.85 --enforce-eager

And was met with the following errors:

/home/user/PycharmProjects/vllm/.venv/bin/python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4v-9b --quantization fp6_weights --trust_remote_code --max-model-len 8192 --gpu-memory-utilization 0.85 --enforce-eager 
INFO 10-14 12:32:15 api_server.py:528] vLLM API server version 0.1.dev2997+g50f3245
INFO 10-14 12:32:15 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='THUDM/glm-4v-9b', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='fp6_weights', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-14 12:32:15 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/6127757f-aec4-4061-88e2-6fe8713c6577 for IPC Path.
INFO 10-14 12:32:15 api_server.py:179] Started engine process with PID 25188
INFO 10-14 12:32:27 config.py:304] torch.bfloat16 data type is not supported for fp6 quantization. Using float16 instead.
WARNING 10-14 12:32:27 config.py:310] CUDA Graph execution may not work with fp6 quantization. You can try disabling it with `enforce_eager=True` if you run into issues.
WARNING 10-14 12:32:27 config.py:405] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-14 12:32:27 llm_engine.py:237] Initializing an LLM engine (v0.1.dev2997+g50f3245) with config: model='THUDM/glm-4v-9b', speculative_config=None, tokenizer='THUDM/glm-4v-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp6_weights, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=THUDM/glm-4v-9b, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-14 12:32:28 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 10-14 12:32:29 config.py:304] torch.bfloat16 data type is not supported for fp6 quantization. Using float16 instead.
WARNING 10-14 12:32:29 config.py:310] CUDA Graph execution may not work with fp6 quantization. You can try disabling it with `enforce_eager=True` if you run into issues.
WARNING 10-14 12:32:29 config.py:405] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-14 12:32:29 model_runner.py:1045] Starting to load model THUDM/glm-4v-9b...
INFO 10-14 12:32:29 fp_eXmY.py:60] Loading model in FP6_E2M3 format.
INFO 10-14 12:32:29 weight_utils.py:243] Using model weights format ['*.safetensors']
WARNING 10-14 12:32:30 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:13,  1.06it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:01<00:11,  1.09it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:02<00:11,  1.09it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:03<00:09,  1.14it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:04<00:08,  1.16it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:05<00:07,  1.17it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:06<00:06,  1.18it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:06<00:06,  1.17it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:07<00:05,  1.15it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:08<00:03,  1.45it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:08<00:02,  1.40it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:09<00:02,  1.41it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:10<00:01,  1.33it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:11<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:12<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:12<00:00,  1.25it/s]

INFO 10-14 12:32:46 model_runner.py:1056] Loading model weights took 11.2750 GB
WARNING 10-14 12:32:54 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 10-14 12:32:54 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123254.pkl...
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner.py", line 1650, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/chatglm.py", line 613, in forward
    hidden_states = self.transformer(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/chatglm.py", line 539, in forward
    image_embeds = self.vision(pixel_values)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 284, in forward
    x = self.transformer(x)
        ^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 174, in forward
    hidden_states = layer_module(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 151, in forward
    self.attention(attention_input))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 84, in forward
    qkv, _ = self.query_key_value(x)  # B, L, 3 * H * D
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/layers/linear.py", line 371, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/layers/quantization/fp_eXmY.py", line 165, in apply
    return ops.fp_eXmY_linear_forward_cuda(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/_custom_ops.py", line 322, in fp_eXmY_linear_forward_cuda
    return torch.ops._C.fp_eXmY_linear_forward_cuda(EXPONENT, MANTISSA,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected in_features to be a multiple of 64, but received 6401

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
           ^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/home/user/PycharmProjects/vllm/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner.py", line 1294, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123254.pkl): Expected in_features to be a multiple of 64, but received 6401
INFO 10-14 12:32:54 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123254.pkl.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

Process finished with exit code 1

I attempted to step through the debugger to see if I could parse what the issue is, but it is above my pay grade.

Here is the vision config for GLM-4V:

  "vision_config": {
    "dropout_prob": 0.0,
    "hidden_act": "gelu",
    "in_channels": 3,
    "num_hidden_layers": 63,
    "hidden_size": 1792,
    "patch_size": 14,
    "num_heads": 16,
    "intermediate_size": 15360,
    "layer_norm_eps": 1e-06,
    "num_positions": 6401,
    "image_size": 1120,
    "scaling_factor": 8
  },

Does this appear to be an issue with the quant. implementation, or perhaps something else?

AlpinDale · 2024-10-27T21:19:52Z

@iamthemulti I have not tested this PR against VLMs so I'm unable to comment on that. I will take a look later. cc @Isotr0py

This reverts commit 3e42deb.

This reverts commit 81b2c0f.

Isotr0py · 2024-10-28T07:33:24Z

@iamthemulti @AlpinDale That's because pixel_values inputs for vision encoder usually have an extra batch_size dim with shape like torch.Size([2, 6401, 1792]), which cause the output_dim indexed incorrectly.

Isotr0py · 2024-10-28T07:35:12Z

vllm/model_executor/layers/quantization/fp_eXmY.py

+        if bias is None:
+            return ops.fp_eXmY_linear_forward_cuda(
+                self.quant_config.exponent_bits,
+                self.quant_config.mantissa_bits, x, weights, scales, splitK)
+        else:
+            return ops.fp_eXmY_linear_forward_cuda(
+                self.quant_config.exponent_bits,
+                self.quant_config.mantissa_bits, x, weights, scales,
+                splitK) + bias


Suggested change

if bias is None:

return ops.fp_eXmY_linear_forward_cuda(

self.quant_config.exponent_bits,

self.quant_config.mantissa_bits, x, weights, scales, splitK)

else:

return ops.fp_eXmY_linear_forward_cuda(

self.quant_config.exponent_bits,

self.quant_config.mantissa_bits, x, weights, scales,

splitK) + bias

reshape_after_matmul = False

if x.ndim > 2:

original_shape = x.shape

x = x.reshape(-1, x.size(-1))

reshape_after_matmul = True

out = ops.fp_eXmY_linear_forward_cuda(

self.quant_config.exponent_bits,

self.quant_config.mantissa_bits, x, weights, scales, splitK)

if reshape_after_matmul:

out = out.view(*original_shape[:-1], out.size(-1))

if bias is not None:

out += bias

return out

FYI, I can run glm-4v inferenve with this change. :)

mergify · 2024-12-11T09:30:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AlpinDale.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dengyingxu · 2025-01-13T06:34:00Z

Could you please let me know if this feature will continue to be developed? In what ways is it currently not meeting expectations?

AlpinDale and others added 2 commits September 23, 2024 15:56

add kernels

be2dc59

implement quant_llm within vLLM

a4cafa3

Co-authored-by: intervitens <intervitens@tutanota.com>

format

4e57c0a

AlpinDale commented Sep 24, 2024

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

Update vllm/config.py

c180657

AlpinDale commented Sep 24, 2024

View reviewed changes

csrc/quantization/quant_llm/quant_llm_linear.cu Outdated Show resolved Hide resolved

AlpinDale and others added 2 commits September 24, 2024 00:20

Update csrc/quantization/quant_llm/quant_llm_linear.cu

425c5ad

format

e4bc2d7

flozi00 mentioned this pull request Sep 24, 2024

[RFC]: quant llm from alpindale #8716

Closed

1 task

russellb reviewed Sep 24, 2024

View reviewed changes

csrc/quantization/quant_llm/quant_llm_linear.cu Outdated Show resolved Hide resolved

AlpinDale and others added 2 commits September 24, 2024 13:57

Update csrc/quantization/quant_llm/quant_llm_linear.cu

070990e

Co-authored-by: Russell Bryant <rbryant@redhat.com>

remove quant_llm as a quant method

aa808f5

mgoin reviewed Sep 24, 2024

View reviewed changes

tlrmchlsmth reviewed Sep 24, 2024

View reviewed changes

csrc/quantization/quant_llm/quant_llm_linear.cu Outdated Show resolved Hide resolved

csrc/quantization/quant_llm/quant_llm_linear.cu Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Sep 24, 2024

View reviewed changes

AlpinDale and others added 6 commits September 25, 2024 14:48

Update csrc/quantization/quant_llm/quant_llm_linear.cu

b4ddabb

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

Update csrc/quantization/quant_llm/quant_llm_linear.cu

2cc8ba0

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

address comments: rename to fp_eXmY and ignore for clang-format

7539fc6

address comment: keep one source of truth for exp/weight bits

931da32

use TORCH_CHECK instead of assert

5cfbeee

remove unneeded comment about min capability

298df53

AlpinDale mentioned this pull request Sep 25, 2024

[Feature]: FP6 #4515

Closed

AlpinDale added 2 commits September 25, 2024 16:58

remove config filenames

0656923

warnings for fp16 downcast and cuda graphs

3900ccf

Merge branch 'main' into quant_llm

e60b702

Format

d834201

mgoin reviewed Oct 1, 2024

View reviewed changes

Merge branch 'main' into quant_llm

7b4dd80

AlpinDale requested a review from WoosukKwon as a code owner October 27, 2024 20:41

AlpinDale added 2 commits October 27, 2024 20:49

fully rename everything to fp_eXmY

dd85d7a

pass _in_feats.device() to stream capture

3e42deb

AlpinDale added 3 commits October 27, 2024 21:28

Revert "pass _in_feats.device() to stream capture"

81b2c0f

This reverts commit 3e42deb.

use the total number of tokens (flattened bsz) for split-K reduction

9341a47

Revert "Revert "pass _in_feats.device() to stream capture""

0bdfb5d

This reverts commit 81b2c0f.

Isotr0py reviewed Oct 28, 2024

View reviewed changes

tjtanaa mentioned this pull request Dec 10, 2024

Roadmap EmbeddedLLM/vllm#4

Open

16 tasks

mergify bot added the ci/build label Dec 11, 2024

mergify bot added the needs-rebase label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

AlpinDale commented Sep 23, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024

mgoin commented Sep 24, 2024 •

edited

Loading

tlrmchlsmth Sep 24, 2024

AlpinDale Sep 26, 2024

ghost commented Sep 25, 2024

mgoin commented Sep 25, 2024

AlpinDale commented Sep 25, 2024

AlpinDale commented Sep 26, 2024

mgoin Sep 30, 2024

AlpinDale Oct 27, 2024

AlpinDale Oct 27, 2024

mgoin Oct 1, 2024

AlpinDale Oct 27, 2024

AlpinDale Oct 27, 2024

mgoin Oct 1, 2024

mgoin left a comment

iamthemulti commented Oct 14, 2024

AlpinDale commented Oct 27, 2024

Isotr0py commented Oct 28, 2024

Isotr0py Oct 28, 2024

mergify bot commented Dec 11, 2024

dengyingxu commented Jan 13, 2025

	reshaped_x = input.reshape(-1, input.shape[-1])
	out_shape = input.shape[:-1] + (output_size_per_partition, )

-        if bias is None:
-            return ops.fp_eXmY_linear_forward_cuda(
-                self.quant_config.exponent_bits,
-                self.quant_config.mantissa_bits, x, weights, scales, splitK)
-        else:
-            return ops.fp_eXmY_linear_forward_cuda(
-                self.quant_config.exponent_bits,
-                self.quant_config.mantissa_bits, x, weights, scales,
-                splitK) + bias
+        reshape_after_matmul = False
+        if x.ndim > 2:
+            original_shape = x.shape
+            x = x.reshape(-1, x.size(-1))
+            reshape_after_matmul = True
+        out = ops.fp_eXmY_linear_forward_cuda(
+                self.quant_config.exponent_bits,
+                self.quant_config.mantissa_bits, x, weights, scales, splitK)
+        if reshape_after_matmul:
+            out = out.view(*original_shape[:-1], out.size(-1))
+        if bias is not None:
+             out += bias
+        return out

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

Are you sure you want to change the base?

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

Conversation

AlpinDale commented Sep 23, 2024 • edited Loading

Usage

TODO

Benchmarks

Throughput

Accuracy

Meta-Llama-3.1-8B-Instruct

Mistral-Small-Instruct-2409

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Sep 23, 2024

mgoin commented Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Sep 25, 2024

mgoin commented Sep 25, 2024

AlpinDale commented Sep 25, 2024

AlpinDale commented Sep 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

iamthemulti commented Oct 14, 2024

AlpinDale commented Oct 27, 2024

Isotr0py commented Oct 28, 2024

Choose a reason for hiding this comment

mergify bot commented Dec 11, 2024

dengyingxu commented Jan 13, 2025

AlpinDale commented Sep 23, 2024 •

edited

Loading

mgoin commented Sep 24, 2024 •

edited

Loading