Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

Merged
merged 18 commits into from
Feb 4, 2025

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Feb 2, 2025

Continue #12509, I used incorrect command to force-push for DCO sign-off by mistake before 😅

TODO

Signed-off-by: Isotr0py <2037008807@qq.com>
Copy link

github-actions bot commented Feb 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 4, 2025
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py Isotr0py marked this pull request as ready for review February 4, 2025 07:42
Signed-off-by: Isotr0py <2037008807@qq.com>
Isotr0py and others added 5 commits February 4, 2025 16:04
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@DarkLight1337
Copy link
Member

I'm unable to run the example script: python examples/offline_inference/vision_language.py -m idefics3 even after the typo fix.

Copy link

mergify bot commented Feb 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 4, 2025
Signed-off-by: Isotr0py <2037008807@qq.com>
@mergify mergify bot removed the needs-rebase label Feb 4, 2025
Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Feb 4, 2025

The example script should work now:

$ python examples/offline_inference/vision_language.py -m idefics3
INFO 02-04 09:20:13 __init__.py:186] Automatically detected platform cuda.
WARNING 02-04 09:20:14 config.py:2387] Casting torch.bfloat16 to torch.float16.
INFO 02-04 09:20:21 config.py:542] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 02-04 09:20:21 config.py:1402] Defaulting to use mp for distributed inference
WARNING 02-04 09:20:21 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-04 09:20:21 config.py:678] Async output processing is not supported on the current platform type cuda.
INFO 02-04 09:20:21 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev4406+g1298a40) with config: model='HuggingFaceM4/Idefics3-8B-Llama3', speculative_config=None, tokenizer='HuggingFaceM4/Idefics3-8B-Llama3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=HuggingFaceM4/Idefics3-8B-Llama3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'size': {'longest_edge': 1092}}, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-04 09:20:22 multiproc_worker_utils.py:300] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-04 09:20:22 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:22 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
INFO 02-04 09:20:23 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-04 09:20:23 cuda.py:227] Using XFormers backend.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:23 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:23 cuda.py:227] Using XFormers backend.
INFO 02-04 09:20:34 utils.py:940] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 utils.py:940] Found nccl from library libnccl.so.2
INFO 02-04 09:20:34 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-04 09:20:34 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 02-04 09:20:34 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_88740f79'), local_subscribe_port=58099, remote_subscribe_port=None)
INFO 02-04 09:20:34 model_runner.py:1113] Starting to load model HuggingFaceM4/Idefics3-8B-Llama3...
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 model_runner.py:1113] Starting to load model HuggingFaceM4/Idefics3-8B-Llama3...
INFO 02-04 09:20:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-04 09:20:34 cuda.py:227] Using XFormers backend.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 cuda.py:227] Using XFormers backend.
INFO 02-04 09:20:34 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
INFO 02-04 09:20:35 weight_utils.py:252] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:35 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.72s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:06,  3.18s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:09<00:03,  3.53s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.41s/it]

(VllmWorkerProcess pid=40920) INFO 02-04 09:20:49 model_runner.py:1118] Loading model weights took 7.9459 GB
INFO 02-04 09:20:49 model_runner.py:1118] Loading model weights took 7.9459 GB
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] Memory profiling takes 11.91 seconds
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] model weights take 7.95GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.60GiB; the rest of the memory reserved for KV Cache is 4.60GiB.
INFO 02-04 09:21:01 worker.py:267] Memory profiling takes 12.13 seconds
INFO 02-04 09:21:01 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 02-04 09:21:01 worker.py:267] model weights take 7.95GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.60GiB; the rest of the memory reserved for KV Cache is 4.60GiB.
INFO 02-04 09:21:01 executor_base.py:110] # CUDA blocks: 4709, # CPU blocks: 4096
INFO 02-04 09:21:01 executor_base.py:115] Maximum concurrency for 8192 tokens per request: 9.20x
INFO 02-04 09:21:05 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 16.79 seconds
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.13s/it, est. speed input: 593.03 toks/s, output: 30.10 toks/s]
 The image depicts a scene featuring a tall, slender tower in the background, which is partially obscured by a cluster of cherry blossom trees. The tower is white and appears to be a modern structure, possibly a skyscraper or a tower building, with a cylindrical shape and multiple floors. The tower is situated against a clear blue
 The image depicts a scene featuring a tall, white tower with a modern architectural design. The tower is partially obscured by a backdrop of delicate, pink cherry blossoms. The blossoms are in full bloom, creating a beautiful contrast against the clear blue sky. The blossoms are clustered on slender branches, and their petals are
 The image depicts a scene featuring a tall, slender tower structure in the background, which appears to be the Tokyo Tower, identifiable by its distinctive design and height. The tower is primarily white and has a lattice-like structure, characteristic of its architectural style. The sky above the tower is clear and blue, suggesting a bright and
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The cherry blossoms are vibrant and numerous, with delicate pink petals that are slightly open, giving the tree a lush and vibrant appearance. The tree's branches are visible, extending upwards and outwards, creating a natural frame for

@DarkLight1337
Copy link
Member

Can confirm.

V0 output:

# python examples/offline_inference/vision_language.py -m idefics3
 The image depicts a scene that combines natural beauty with urban architecture. The primary focus of the image is a tall, white tower that stands prominently in the background. The tower has a modern architectural design, characterized by its sleek, cylindrical shape and numerous windows. The tower is likely a skyscraper, possibly a well-known landmark
 The image depicts a scene dominated by a tall, white tower in the background, which appears to be the Tokyo Tower, a well-known landmark in Japan. The tower is partially obscured by a cluster of cherry blossom trees in full bloom. The cherry blossoms are vibrant with delicate pink and white petals, creating a beautiful contrast
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The tree is positioned in the foreground, with its branches extending towards the viewer. The blossoms are vibrant pink and appear to be in full bloom, creating a dense and lush canopy of flowers. The delicate petals of the bloss
 The image depicts a serene and picturesque scene featuring a cherry blossom tree in full bloom against a clear blue sky. The cherry blossoms are vibrant and numerous, with delicate pink petals that appear to be gently swaying in the breeze. The tree's branches are adorned with clusters of blossoms, creating a beautiful and intricate pattern

V1 output

# VLLM_USE_V1=1 python examples/offline_inference/vision_language.py -m idefics3
 The image depicts a scene that combines natural beauty with urban architecture. The primary focus is on a tree with delicate, pink blossoms that are in full bloom. The tree's branches are adorned with numerous small, light pink flowers, creating a soft, almost ethereal appearance. The blossoms are clustered together, forming a
 The image depicts a scene featuring a tall, white tower in the background, partially obscured by a cluster of delicate, pink cherry blossoms. The tower appears to be a modern structure, possibly a skyscraper or a monument, characterized by its sleek, metallic design and numerous windows. The tower is set against a clear blue
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The tree is positioned in the foreground, with its branches extending towards the viewer. The blossoms are vibrant pink and appear to be in full bloom, creating a dense and lush canopy of flowers. The delicate petals of the bloss
 The image depicts a scene featuring a tall, white tower in the background, partially obscured by a cluster of delicate, pink cherry blossoms. The cherry blossoms are in full bloom, with their petals hanging gracefully from the branches of the tree. The blossoms are predominantly light pink, with some of them displaying a slightly

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processor test passes locally, so it should be good to go. Thanks for helping out with the refactoring effort!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 4, 2025 09:31
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025
@DarkLight1337
Copy link
Member

It looks like #12553 broke the basic tests, but the problem got through unnoticed as it was masked by the HF connection error

@DarkLight1337
Copy link
Member

I'll fix this in another PR

@youkaichao youkaichao disabled auto-merge February 4, 2025 12:00
@youkaichao youkaichao merged commit 815079d into vllm-project:main Feb 4, 2025
45 of 50 checks passed
@Isotr0py Isotr0py deleted the v1-idefics3-fix branch February 4, 2025 12:09
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Felix Marty <felmarty@amd.com>
ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Mar 5, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Linkun Chen <github@lkchen.net>
Said-Akbar pushed a commit to Said-Akbar/vllm-rocm that referenced this pull request Mar 7, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: saeediy <saidakbarp@gmail.com>
qscqesze pushed a commit to ZZBoom/vllm that referenced this pull request Mar 13, 2025
…roject#12660)

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants