Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Quantization] Custom Floating-Point Runtime Quantization #8751

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

AlpinDale
Copy link
Contributor

@AlpinDale AlpinDale commented Sep 23, 2024

This PR adds support for loading FP16 models in custom Floating-Point formats at runtime. It currently supports FP4, FP5, FP6, and FP7.

Some useful features:

  • Loads at the specified data type, so you won't need enough memory to fit the FP16 weights (unlike the current FP8 quant in vLLM)
  • Preserves FP16 batched throughput
  • FP6 matches FP8/FP16 accuracy
  • Near-instant quantization; takes ~8 seconds for 8B

The method is based on FP6-LLM, and ported from aphrodite-engine/aphrodite-engine#755

FIX #8716
FIX #4515

Usage

vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct -q fp6_weights

TODO

  • Investigate issues with Qwen models
  • Implement Channel-wise quantization to improve FP4 performance
  • Extend support to FP8 (primarily for Ampere GPUs, as an alternative to Marlin)
  • Try and support Turing (low priority)

Benchmarks

Throughput

Tested on NVIDIA H100 SXM.

image

H100* typo in image
image

Expand for detailed numbers
#!/bin/bash

quantization_modes=( "fp4_weights" "fp5_weights" "fp6_weights" "fp7_weights"  "fp8")
tp_sizes=( 1 2 4)
models=("NousResearch/Meta-Llama-3.1-8B-Instruct" "mistralai/Mistral-Small-Instruct-2409")

for model in "${models[@]}"; do
	for quantization_mode in "${quantization_modes[@]}"; do
	  for tp_size in "${tp_sizes[@]}"; do
	    python benchmarks/benchmark_throughput.py \
	        --model $model$ \
	        --gpu-memory-utilization 0.98 \
	        --input-len 512 \
	        --output-len 52 \
	        --quantization $quantization_mode \
	        -tp $tp_size \
	        --max-model-len 2048 \
	        --enable-chunked-prefill
	  done
	done

Meta-Llama-3.1-8B

Data Type TP Size Request/s Total Tokens/s Weight Mem Usage (GiB)
FP4 1 24.56 13851.68 3.62
2 32.00 18046.53
4 34.68 19560.06
FP5 1 23.13 13044.27 6.11
2 30.69 17307.73
4 32.74 18462.69
FP6 1 23.24 13104.85 6.87
2 30.91 17431.46
4 33.14 18690.69
FP7 1 22.48 12678.27 7.74
2 30.24 17053.81
4 33.44 18857.93
FP8 (W8A8) 1 34.27 19328.41 8.49
2 29.61 16699.40
4 29.07 16392.71
BF16 1 36.29 20466.76 14.98
2 30.70 17312.38
4 31.47 17751.32

Mistral-Small-Instruct-2409 (FP8 OOM'd at TP1, had to quantize first):

Data Type TP Size Request/s Total Tokens/s Weight Mem Usage (GiB)
FP4 1 10.39 5859.19 10.93
2 16.34 9214.50
4 21.07 11885.75
FP5 1 9.51 5365.26 13.47
2 15.20 8574.19
4 19.91 11227.66
FP6 1 9.62 5424.02 16.07
2 15.35 8659.38
4 19.93 11237.89
FP7 1 9.17 5169.51 18.59
2 14.88 8393.25
4 19.74 11133.56
FP8 (W8A8) 1 23.23 19328.41 21.11
2 19.61 11057.32
4 19.77 11153.04
BF16 1 18.25 10290.51 41.44
2 19.31 10890.39
4 19.17 10811.15

Accuracy

image
image

Expand for more details
lm_eval --model vllm --model_args pretrained=NousResearch/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=2,add_bos_token=True,quantization=fp4_weights --tasks gsm8k --num_fewshot 5 --batch_size auto

Meta-Llama-3.1-8B-Instruct

FP4
1:15 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.6619 ± 0.0130
strict-match 5 exact_match 0.6497 ± 0.0131

FP5
1:15 minutes to run GSM8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7551 ± 0.0118
strict-match 5 exact_match 0.6566 ± 0.0131

FP6
1:17 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7930 ± 0.0112
strict-match 5 exact_match 0.7293 ± 0.0122

FP7
1:17 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7771 ± 0.0115
strict-match 5 exact_match 0.7187 ± 0.0124

FP8 W8A8
1:17 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7733 ± 0.0115
strict-match 5 exact_match 0.6899 ± 0.0127

BF16
1:14 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7771 ± 0.0115
strict-match 5 exact_match 0.7096 ± 0.0125

Mistral-Small-Instruct-2409

FP4
2:32 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7642 ± 0.0117
strict-match 5 exact_match 0.7513 ± 0.0119

FP5
2:46 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8211 ± 0.0106
strict-match 5 exact_match 0.8097 ± 0.0108

FP6
2:46 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8203 ± 0.0106
strict-match 5 exact_match 0.8059 ± 0.0109

FP7
2:53 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8165 ± 0.0107
strict-match 5 exact_match 0.8021 ± 0.0110

FP8 W8A8
1:20 minutes to run GMS8K eval

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8249 ± 0.0105
strict-match 5 exact_match 0.8120 ± 0.0108

BF16
1:33 minutes to run GMS8K

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8249 ± 0.0105
strict-match 5 exact_match 0.8097 ± 0.0108

cc @mgoin @robertgshaw2-neuralmagic


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

AlpinDale and others added 2 commits September 23, 2024 15:56
Co-authored-by: intervitens <intervitens@tutanota.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@flozi00 flozi00 mentioned this pull request Sep 24, 2024
1 task
@mgoin
Copy link
Member

mgoin commented Sep 24, 2024

Comparing wheel sizes from the docker build job.

before:

[2024-09-23T23:43:23Z] #31 0.944 Wheel dist/vllm-0.6.1.post3.dev102+g17ed7583.d20240923.cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (217.23 MB).

after:

[2024-09-24T14:14:40Z] #31 0.792 Wheel dist/vllm-0.6.1.post2+cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (221.51 MB).

This seems suspiciously small, but I'll take that as a positive signal!

// stream (0) this fixes problem with CUDA graphs when used with
// torch.compile()
auto stream = at::cuda::getCurrentCUDAStream();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @mgoin's suggestion above, Could you add the following to make sure the kernel runs on the right GPU?

at::cuda::OptionalCUDAGuard const device_guard(device_of(_in_feats));

And when you add unit tests, it would be nice to add a test for running on a GPU other than GPU 0. (We do this in tests/kernels/test_cutlass.py for example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this with little luck, but maybe I did it wrong. Will try again.

@ghost
Copy link

ghost commented Sep 25, 2024

This seems to overlap/replace the FP6/8 "deepspeedfp" quantization added in #4652.

@mgoin
Copy link
Member

mgoin commented Sep 25, 2024

@charai-frontend yep I think we can remove deepspeedfp after this lands

@AlpinDale AlpinDale mentioned this pull request Sep 25, 2024
@AlpinDale
Copy link
Contributor Author

@mgoin Since CUDA graph issues are rare, I believe we can keep eager mode disabled for this quant, but with a warning that the user should enable it if they run into any issues.

russellb added a commit to russellb/instructlab that referenced this pull request Sep 25, 2024
This change adds a new configuration option for vllm: a path to a
Python virtual environment to use when running vllm. This allows you
to run a custom version of vllm for development and testing purposes,
even if that version of vllm has incompatible python dependencies with
the current version of instructlab.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required:

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_venv: "/home/ec2-user/venv"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 25, 2024
This change adds a new configuration option for vllm: a path to a
Python virtual environment to use when running vllm. This allows you
to run a custom version of vllm for development and testing purposes,
even if that version of vllm has incompatible python dependencies with
the current version of instructlab.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required:

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_venv: "/home/ec2-user/venv"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@AlpinDale
Copy link
Contributor Author

Needle Test results for Magnum-v2-72B model (Qwen2-72B finetune) over 16384 tokens of context. FP6 seems to slightly outperform BF16.

magnum-v2-72b-16K.png

magnum-v2-72b-fp6-16K.png

russellb added a commit to russellb/instructlab that referenced this pull request Sep 27, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 27, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 27, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 27, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 28, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 28, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 30, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb added a commit to russellb/instructlab that referenced this pull request Sep 30, 2024
This change adds a new configuration option for vllm: a path to the
vllm exxecutable. This allows you to point to an instance of `vllm`
installed in a different Python virtual environment than `ilab`.

This also changes the `vllm` backend to run `vllm serve` instead of an
older style of specifying the openai api server entrypoint manually.

As an example of how this is useful, I wanted to try the following
vllm PR to run mixtral on a host with less GPU memory than is
typically required. Running this version has Python dependencies not
compatible with current `ilab`.

    vllm-project/vllm#8751

To install vllm into its own venv:

    python3.11 -m venv venv-vllm
    . venv-vllm/bin/activate
    git clone https://github.com/vllm-project/vllm
    cd vllm
    git fetch origin pull/8751/head
    git checkout FETCH_HEAD
    export CUDA_HOME=/usr/local/cuda
    export PATH=${CUDA_HOME}/bin:${PATH}
    pip install -e .

Back in my instructlab venv ...

    . venv-ilab/bin/activate
    ilab model download --repository mistralai/Mixtral-8x7B-Instruct-v0.1
    ilab config edit

Edit vllm config. Note the value of `--tensor-parallel-size` is for
the number of GPUs I have available.

    serve:
      backend: vllm
      model_path: /home/ec2-user/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1
      vllm:
        vllm_args: ['--tensor-parallel-size', '4', '-q', 'fp6_weights']
        vllm_path: "/home/ec2-user/venv/bin/vllm"

This now works as mixtral is quantized at load-time, cutting the gpu
memory requirement in half.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Comment on lines 230 to 233
// NOTE(alpin): use at::cuda::getCurrentCUDAStream() instead of default
// stream (0) this fixes problem with CUDA graphs when used with
// torch.compile()
auto stream = at::cuda::getCurrentCUDAStream();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try passing in _in_feats.device() to get the cuda stream for the current device? Usually we do this at::cuda::getCurrentCUDAStream(dev)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to break tensor parallelism with custom all-reduce:

(VllmWorkerProcess pid=2878816) INFO 10-27 21:26:21 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/austin/.cache/vllm/gpu_p2p_access_cache_for_8,9.json
Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'
Failed: Cuda error /home/austin/disk1/vllm/csrc/custom_all_reduce.cuh:336 'invalid argument'

Reverting for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: doesn't seem to be caused by this.

weights = weight.data
scales = weight.scales
out_dim, in_dim = weights.shape
bsize = x.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should bsize actually be the number of tokens? So flattened batch size and num tokens? For instance for marlin we flatten the input

reshaped_x = input.reshape(-1, input.shape[-1])
out_shape = input.shape[:-1] + (output_size_per_partition, )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I'll try this and get back to you (sorry for the late response, I've been busy with work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flattened batch size:

Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.50it/s, est. speed input: 6909.52 toks/s, output: 701.75 tok/s

First dim only:

Processed prompts: 100%|█| 1000/1000 [01:14<00:00, 13.48it/s, est. speed input: 6899.52 toks/s, output: 700.73 tok/s

Likely within margin of error, I think we can go with this.

DEFAULT_FP_EXMY_EXP_BITS, VALID_FP_EXMY_METHODS)
if self.quantization is not None and self.quantization in \
VALID_FP_EXMY_METHODS:
fp_bits = int(self.quantization[2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of sketchy

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline, it would be nice to have some kernel unit tests to compare against baseline unquantized or qdq

@iamthemulti
Copy link

I tested this PR with the recently added GLM-4V support (multimodal model) using the following command:

python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4v-9b --quantization fp6_weights --trust_remote_code --max-model-len 8192 --gpu-memory-utilization 0.85 --enforce-eager

And was met with the following errors:

/home/user/PycharmProjects/vllm/.venv/bin/python -m vllm.entrypoints.openai.api_server --model THUDM/glm-4v-9b --quantization fp6_weights --trust_remote_code --max-model-len 8192 --gpu-memory-utilization 0.85 --enforce-eager 
INFO 10-14 12:32:15 api_server.py:528] vLLM API server version 0.1.dev2997+g50f3245
INFO 10-14 12:32:15 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='THUDM/glm-4v-9b', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.85, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='fp6_weights', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-14 12:32:15 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/6127757f-aec4-4061-88e2-6fe8713c6577 for IPC Path.
INFO 10-14 12:32:15 api_server.py:179] Started engine process with PID 25188
INFO 10-14 12:32:27 config.py:304] torch.bfloat16 data type is not supported for fp6 quantization. Using float16 instead.
WARNING 10-14 12:32:27 config.py:310] CUDA Graph execution may not work with fp6 quantization. You can try disabling it with `enforce_eager=True` if you run into issues.
WARNING 10-14 12:32:27 config.py:405] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-14 12:32:27 llm_engine.py:237] Initializing an LLM engine (v0.1.dev2997+g50f3245) with config: model='THUDM/glm-4v-9b', speculative_config=None, tokenizer='THUDM/glm-4v-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp6_weights, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=THUDM/glm-4v-9b, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-14 12:32:28 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 10-14 12:32:29 config.py:304] torch.bfloat16 data type is not supported for fp6 quantization. Using float16 instead.
WARNING 10-14 12:32:29 config.py:310] CUDA Graph execution may not work with fp6 quantization. You can try disabling it with `enforce_eager=True` if you run into issues.
WARNING 10-14 12:32:29 config.py:405] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-14 12:32:29 model_runner.py:1045] Starting to load model THUDM/glm-4v-9b...
INFO 10-14 12:32:29 fp_eXmY.py:60] Loading model in FP6_E2M3 format.
INFO 10-14 12:32:29 weight_utils.py:243] Using model weights format ['*.safetensors']
WARNING 10-14 12:32:30 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:13,  1.06it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:01<00:11,  1.09it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:02<00:11,  1.09it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:03<00:09,  1.14it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:04<00:08,  1.16it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:05<00:07,  1.17it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:06<00:06,  1.18it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:06<00:06,  1.17it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:07<00:05,  1.15it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:08<00:03,  1.45it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:08<00:02,  1.40it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:09<00:02,  1.41it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:10<00:01,  1.33it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:11<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:12<00:00,  1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:12<00:00,  1.25it/s]

INFO 10-14 12:32:46 model_runner.py:1056] Loading model weights took 11.2750 GB
WARNING 10-14 12:32:54 tokenizer.py:169] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 10-14 12:32:54 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-123254.pkl...
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner.py", line 1650, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/chatglm.py", line 613, in forward
    hidden_states = self.transformer(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/chatglm.py", line 539, in forward
    image_embeds = self.vision(pixel_values)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 284, in forward
    x = self.transformer(x)
        ^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 174, in forward
    hidden_states = layer_module(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 151, in forward
    self.attention(attention_input))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/models/glm4_vision_encoder.py", line 84, in forward
    qkv, _ = self.query_key_value(x)  # B, L, 3 * H * D
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/layers/linear.py", line 371, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/model_executor/layers/quantization/fp_eXmY.py", line 165, in apply
    return ops.fp_eXmY_linear_forward_cuda(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/_custom_ops.py", line 322, in fp_eXmY_linear_forward_cuda
    return torch.ops._C.fp_eXmY_linear_forward_cuda(EXPONENT, MANTISSA,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected in_features to be a multiple of 64, but received 6401

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args
    return cls(
           ^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/home/user/PycharmProjects/vllm/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner.py", line 1294, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-123254.pkl): Expected in_features to be a multiple of 64, but received 6401
INFO 10-14 12:32:54 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-123254.pkl.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/user/PycharmProjects/vllm/.venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/PycharmProjects/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

Process finished with exit code 1

I attempted to step through the debugger to see if I could parse what the issue is, but it is above my pay grade.

Here is the vision config for GLM-4V:

  "vision_config": {
    "dropout_prob": 0.0,
    "hidden_act": "gelu",
    "in_channels": 3,
    "num_hidden_layers": 63,
    "hidden_size": 1792,
    "patch_size": 14,
    "num_heads": 16,
    "intermediate_size": 15360,
    "layer_norm_eps": 1e-06,
    "num_positions": 6401,
    "image_size": 1120,
    "scaling_factor": 8
  },

Does this appear to be an issue with the quant. implementation, or perhaps something else?

@AlpinDale AlpinDale requested a review from WoosukKwon as a code owner October 27, 2024 20:41
@AlpinDale
Copy link
Contributor Author

@iamthemulti I have not tested this PR against VLMs so I'm unable to comment on that. I will take a look later. cc @Isotr0py

@Isotr0py
Copy link
Collaborator

@iamthemulti @AlpinDale That's because pixel_values inputs for vision encoder usually have an extra batch_size dim with shape like torch.Size([2, 6401, 1792]), which cause the output_dim indexed incorrectly.

Comment on lines +165 to +173
if bias is None:
return ops.fp_eXmY_linear_forward_cuda(
self.quant_config.exponent_bits,
self.quant_config.mantissa_bits, x, weights, scales, splitK)
else:
return ops.fp_eXmY_linear_forward_cuda(
self.quant_config.exponent_bits,
self.quant_config.mantissa_bits, x, weights, scales,
splitK) + bias
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if bias is None:
return ops.fp_eXmY_linear_forward_cuda(
self.quant_config.exponent_bits,
self.quant_config.mantissa_bits, x, weights, scales, splitK)
else:
return ops.fp_eXmY_linear_forward_cuda(
self.quant_config.exponent_bits,
self.quant_config.mantissa_bits, x, weights, scales,
splitK) + bias
reshape_after_matmul = False
if x.ndim > 2:
original_shape = x.shape
x = x.reshape(-1, x.size(-1))
reshape_after_matmul = True
out = ops.fp_eXmY_linear_forward_cuda(
self.quant_config.exponent_bits,
self.quant_config.mantissa_bits, x, weights, scales, splitK)
if reshape_after_matmul:
out = out.view(*original_shape[:-1], out.size(-1))
if bias is not None:
out += bias
return out

FYI, I can run glm-4v inferenve with this change. :)

@tjtanaa tjtanaa mentioned this pull request Dec 10, 2024
16 tasks
@mergify mergify bot added the ci/build label Dec 11, 2024
Copy link

mergify bot commented Dec 11, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AlpinDale.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 11, 2024
@dengyingxu
Copy link

Could you please let me know if this feature will continue to be developed? In what ways is it currently not meeting expectations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: quant llm from alpindale [Feature]: FP6
7 participants