-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: intervitens <intervitens@tutanota.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Comparing wheel sizes from the docker build job. before:
after:
This seems suspiciously small, but I'll take that as a positive signal! |
// stream (0) this fixes problem with CUDA graphs when used with | ||
// torch.compile() | ||
auto stream = at::cuda::getCurrentCUDAStream(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to @mgoin's suggestion above, Could you add the following to make sure the kernel runs on the right GPU?
at::cuda::OptionalCUDAGuard const device_guard(device_of(_in_feats));
And when you add unit tests, it would be nice to add a test for running on a GPU other than GPU 0. (We do this in tests/kernels/test_cutlass.py
for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this with little luck, but maybe I did it wrong. Will try again.
This seems to overlap/replace the FP6/8 "deepspeedfp" quantization added in #4652. |
@charai-frontend yep I think we can remove deepspeedfp after this lands |
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
@mgoin Since CUDA graph issues are rare, I believe we can keep eager mode disabled for this quant, but with a warning that the user should enable it if they run into any issues. |
This change adds a new configuration option for vllm: a path to a Python virtual environment to use when running vllm. This allows you to run a custom version of vllm for development and testing purposes, even if that version of vllm has incompatible python dependencies with the current version of instructlab. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required: vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_venv: "/home/ec2-user/venv" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to a Python virtual environment to use when running vllm. This allows you to run a custom version of vllm for development and testing purposes, even if that version of vllm has incompatible python dependencies with the current version of instructlab. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required: vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_venv: "/home/ec2-user/venv" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This change adds a new configuration option for vllm: a path to the vllm exxecutable. This allows you to point to an instance of `vllm` installed in a different Python virtual environment than `ilab`. This also changes the `vllm` backend to run `vllm serve` instead of an older style of specifying the openai api server entrypoint manually. As an example of how this is useful, I wanted to try the following vllm PR to run mixtral on a host with less GPU memory than is typically required. Running this version has Python dependencies not compatible with current `ilab`. vllm-project/vllm#8751 To install vllm into its own venv: python3.11 -m venv venv-vllm . venv-vllm/bin/activate git clone https://github.com/vllm-project/vllm cd vllm git fetch origin pull/8751/head git checkout FETCH_HEAD export CUDA_HOME=/usr/local/cuda export PATH=${CUDA_HOME}/bin:${PATH} pip install -e . Back in my instructlab venv ... . venv-ilab/bin/activate ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1 ilab config edit Edit vllm config. Note the value of `--tensor-parallel-size` is for the number of GPUs I have available. serve: backend: vllm model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 vllm: vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights'] vllm_path: "/home/ec2-user/venv/bin/vllm" This now works as mixtral is quantized at load-time, cutting the gpu memory requirement in half. Signed-off-by: Russell Bryant <rbryant@redhat.com>
// NOTE(alpin): use at::cuda::getCurrentCUDAStream() instead of default | ||
// stream (0) this fixes problem with CUDA graphs when used with | ||
// torch.compile() | ||
auto stream = at::cuda::getCurrentCUDAStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try passing in _in_feats.device() to get the cuda stream for the current device? Usually we do this at::cuda::getCurrentCUDAStream(dev)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems to break tensor parallelism with custom all-reduce:
(VllmWorkerProcess pid=2878816) INFO 10-27 21:26:21 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/austin/.cache/vllm/gpu_p2p_access_cache_for_8,9.json
Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Reverting for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: doesn't seem to be caused by this.
weights = weight.data | ||
scales = weight.scales | ||
out_dim, in_dim = weights.shape | ||
bsize = x.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should bsize actually be the number of tokens? So flattened batch size and num tokens? For instance for marlin we flatten the input
vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py
Lines 275 to 276 in 4f341bd
reshaped_x = input.reshape(-1, input.shape[-1]) | |
out_shape = input.shape[:-1] + (output_size_per_partition, ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I'll try this and get back to you (sorry for the late response, I've been busy with work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flattened batch size:
Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.50it/s, est. speed input: 6909.52 toks/s, output: 701.75 tok/s
First dim only:
Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.48it/s, est. speed input: 6899.52 toks/s, output: 700.73 tok/s
Likely within margin of error, I think we can go with this.
DEFAULT_FP_EXMY_EXP_BITS, VALID_FP_EXMY_METHODS) | ||
if self.quantization is not None and self.quantization in \ | ||
VALID_FP_EXMY_METHODS: | ||
fp_bits = int(self.quantization[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of sketchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned offline, it would be nice to have some kernel unit tests to compare against baseline unquantized or qdq
I tested this PR with the recently added GLM-4V support (multimodal model) using the following command:
And was met with the following errors:
I attempted to step through the debugger to see if I could parse what the issue is, but it is above my pay grade. Here is the vision config for GLM-4V:
Does this appear to be an issue with the quant. implementation, or perhaps something else? |
@iamthemulti I have not tested this PR against VLMs so I'm unable to comment on that. I will take a look later. cc @Isotr0py |
@iamthemulti @AlpinDale That's because pixel_values inputs for vision encoder usually have an extra batch_size dim with shape like |
if bias is None: | ||
return ops.fp_eXmY_linear_forward_cuda( | ||
self.quant_config.exponent_bits, | ||
self.quant_config.mantissa_bits, x, weights, scales, splitK) | ||
else: | ||
return ops.fp_eXmY_linear_forward_cuda( | ||
self.quant_config.exponent_bits, | ||
self.quant_config.mantissa_bits, x, weights, scales, | ||
splitK) + bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if bias is None: | |
return ops.fp_eXmY_linear_forward_cuda( | |
self.quant_config.exponent_bits, | |
self.quant_config.mantissa_bits, x, weights, scales, splitK) | |
else: | |
return ops.fp_eXmY_linear_forward_cuda( | |
self.quant_config.exponent_bits, | |
self.quant_config.mantissa_bits, x, weights, scales, | |
splitK) + bias | |
reshape_after_matmul = False | |
if x.ndim > 2: | |
original_shape = x.shape | |
x = x.reshape(-1, x.size(-1)) | |
reshape_after_matmul = True | |
out = ops.fp_eXmY_linear_forward_cuda( | |
self.quant_config.exponent_bits, | |
self.quant_config.mantissa_bits, x, weights, scales, splitK) | |
if reshape_after_matmul: | |
out = out.view(*original_shape[:-1], out.size(-1)) | |
if bias is not None: | |
out += bias | |
return out |
FYI, I can run glm-4v inferenve with this change. :)
This pull request has merge conflicts that must be resolved before it can be |
Could you please let me know if this feature will continue to be developed? In what ways is it currently not meeting expectations? |
This PR adds support for loading FP16 models in custom Floating-Point formats at runtime. It currently supports FP4, FP5, FP6, and FP7.
Some useful features:
The method is based on FP6-LLM, and ported from aphrodite-engine/aphrodite-engine#755
FIX #8716
FIX #4515
Usage
TODO
Benchmarks
Throughput
Tested on NVIDIA H100 SXM.
H100* typo in image

Expand for detailed numbers
Meta-Llama-3.1-8B
Mistral-Small-Instruct-2409 (FP8 OOM'd at TP1, had to quantize first):
Accuracy
Expand for more details
Meta-Llama-3.1-8B-Instruct
FP4
1:15 minutes to run GMS8K eval
FP5
1:15 minutes to run GSM8K eval
FP6
1:17 minutes to run GMS8K eval
FP7
1:17 minutes to run GMS8K eval
FP8 W8A8
1:17 minutes to run GMS8K eval
BF16
1:14 minutes to run GMS8K eval
Mistral-Small-Instruct-2409
FP4
2:32 minutes to run GMS8K eval
FP5
2:46 minutes to run GMS8K eval
FP6
2:46 minutes to run GMS8K eval
FP7
2:53 minutes to run GMS8K eval
FP8 W8A8
1:20 minutes to run GMS8K eval
BF16
1:33 minutes to run GMS8K
cc @mgoin @robertgshaw2-neuralmagic
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Adding or changing kernels
Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.
Tensors
require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.torch.libary.opcheck()
to test the function registration and meta-function for any registered ops. Seetests/kernels
for examples.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!