[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

Isotr0py · 2025-02-02T13:45:48Z

Continue #12509, I used incorrect command to force-push for DCO sign-off by mistake before 😅

TODO

Fix broken size exposure for multimodal processor
Add v1 support
Update idefics3 processor test
Migrate idefics3 test to use smaller model: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct

Signed-off-by: Isotr0py <2037008807@qq.com>

github-actions · 2025-02-02T13:45:59Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Isotr0py <2037008807@qq.com>

vllm/model_executor/models/idefics3.py

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Isotr0py <2037008807@qq.com>

DarkLight1337 · 2025-02-04T08:20:06Z

I'm unable to run the example script: python examples/offline_inference/vision_language.py -m idefics3 even after the typo fix.

mergify · 2025-02-04T08:45:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py · 2025-02-04T09:27:36Z

The example script should work now:

$ python examples/offline_inference/vision_language.py -m idefics3
INFO 02-04 09:20:13 __init__.py:186] Automatically detected platform cuda.
WARNING 02-04 09:20:14 config.py:2387] Casting torch.bfloat16 to torch.float16.
INFO 02-04 09:20:21 config.py:542] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 02-04 09:20:21 config.py:1402] Defaulting to use mp for distributed inference
WARNING 02-04 09:20:21 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-04 09:20:21 config.py:678] Async output processing is not supported on the current platform type cuda.
INFO 02-04 09:20:21 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev4406+g1298a40) with config: model='HuggingFaceM4/Idefics3-8B-Llama3', speculative_config=None, tokenizer='HuggingFaceM4/Idefics3-8B-Llama3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=HuggingFaceM4/Idefics3-8B-Llama3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'size': {'longest_edge': 1092}}, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-04 09:20:22 multiproc_worker_utils.py:300] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-04 09:20:22 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:22 multiproc_worker_utils.py:229] Worker ready; awaiting tasks
INFO 02-04 09:20:23 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-04 09:20:23 cuda.py:227] Using XFormers backend.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:23 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:23 cuda.py:227] Using XFormers backend.
INFO 02-04 09:20:34 utils.py:940] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 utils.py:940] Found nccl from library libnccl.so.2
INFO 02-04 09:20:34 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 pynccl.py:69] vLLM is using nccl==2.21.5
INFO 02-04 09:20:34 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 02-04 09:20:34 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_88740f79'), local_subscribe_port=58099, remote_subscribe_port=None)
INFO 02-04 09:20:34 model_runner.py:1113] Starting to load model HuggingFaceM4/Idefics3-8B-Llama3...
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 model_runner.py:1113] Starting to load model HuggingFaceM4/Idefics3-8B-Llama3...
INFO 02-04 09:20:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-04 09:20:34 cuda.py:227] Using XFormers backend.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 cuda.py:227] Using XFormers backend.
INFO 02-04 09:20:34 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:34 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
INFO 02-04 09:20:35 weight_utils.py:252] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=40920) INFO 02-04 09:20:35 weight_utils.py:252] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.72s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:05<00:06,  3.18s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:09<00:03,  3.53s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:13<00:00,  3.41s/it]

(VllmWorkerProcess pid=40920) INFO 02-04 09:20:49 model_runner.py:1118] Loading model weights took 7.9459 GB
INFO 02-04 09:20:49 model_runner.py:1118] Loading model weights took 7.9459 GB
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] Memory profiling takes 11.91 seconds
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
(VllmWorkerProcess pid=40920) INFO 02-04 09:21:01 worker.py:267] model weights take 7.95GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.60GiB; the rest of the memory reserved for KV Cache is 4.60GiB.
INFO 02-04 09:21:01 worker.py:267] Memory profiling takes 12.13 seconds
INFO 02-04 09:21:01 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 02-04 09:21:01 worker.py:267] model weights take 7.95GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.60GiB; the rest of the memory reserved for KV Cache is 4.60GiB.
INFO 02-04 09:21:01 executor_base.py:110] # CUDA blocks: 4709, # CPU blocks: 4096
INFO 02-04 09:21:01 executor_base.py:115] Maximum concurrency for 8192 tokens per request: 9.20x
INFO 02-04 09:21:05 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 16.79 seconds
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
WARNING 02-04 09:21:08 utils.py:1464] The following intended overrides are not keyword-only args and and will be dropped: {'size'}
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.13s/it, est. speed input: 593.03 toks/s, output: 30.10 toks/s]
 The image depicts a scene featuring a tall, slender tower in the background, which is partially obscured by a cluster of cherry blossom trees. The tower is white and appears to be a modern structure, possibly a skyscraper or a tower building, with a cylindrical shape and multiple floors. The tower is situated against a clear blue
 The image depicts a scene featuring a tall, white tower with a modern architectural design. The tower is partially obscured by a backdrop of delicate, pink cherry blossoms. The blossoms are in full bloom, creating a beautiful contrast against the clear blue sky. The blossoms are clustered on slender branches, and their petals are
 The image depicts a scene featuring a tall, slender tower structure in the background, which appears to be the Tokyo Tower, identifiable by its distinctive design and height. The tower is primarily white and has a lattice-like structure, characteristic of its architectural style. The sky above the tower is clear and blue, suggesting a bright and
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The cherry blossoms are vibrant and numerous, with delicate pink petals that are slightly open, giving the tree a lush and vibrant appearance. The tree's branches are visible, extending upwards and outwards, creating a natural frame for

DarkLight1337 · 2025-02-04T09:28:35Z

Can confirm.

V0 output:

# python examples/offline_inference/vision_language.py -m idefics3
 The image depicts a scene that combines natural beauty with urban architecture. The primary focus of the image is a tall, white tower that stands prominently in the background. The tower has a modern architectural design, characterized by its sleek, cylindrical shape and numerous windows. The tower is likely a skyscraper, possibly a well-known landmark
 The image depicts a scene dominated by a tall, white tower in the background, which appears to be the Tokyo Tower, a well-known landmark in Japan. The tower is partially obscured by a cluster of cherry blossom trees in full bloom. The cherry blossoms are vibrant with delicate pink and white petals, creating a beautiful contrast
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The tree is positioned in the foreground, with its branches extending towards the viewer. The blossoms are vibrant pink and appear to be in full bloom, creating a dense and lush canopy of flowers. The delicate petals of the bloss
 The image depicts a serene and picturesque scene featuring a cherry blossom tree in full bloom against a clear blue sky. The cherry blossoms are vibrant and numerous, with delicate pink petals that appear to be gently swaying in the breeze. The tree's branches are adorned with clusters of blossoms, creating a beautiful and intricate pattern

V1 output

# VLLM_USE_V1=1 python examples/offline_inference/vision_language.py -m idefics3
 The image depicts a scene that combines natural beauty with urban architecture. The primary focus is on a tree with delicate, pink blossoms that are in full bloom. The tree's branches are adorned with numerous small, light pink flowers, creating a soft, almost ethereal appearance. The blossoms are clustered together, forming a
 The image depicts a scene featuring a tall, white tower in the background, partially obscured by a cluster of delicate, pink cherry blossoms. The tower appears to be a modern structure, possibly a skyscraper or a monument, characterized by its sleek, metallic design and numerous windows. The tower is set against a clear blue
 The image depicts a scene featuring a cherry blossom tree in full bloom against a clear blue sky. The tree is positioned in the foreground, with its branches extending towards the viewer. The blossoms are vibrant pink and appear to be in full bloom, creating a dense and lush canopy of flowers. The delicate petals of the bloss
 The image depicts a scene featuring a tall, white tower in the background, partially obscured by a cluster of delicate, pink cherry blossoms. The cherry blossoms are in full bloom, with their petals hanging gracefully from the branches of the tree. The blossoms are predominantly light pink, with some of them displaying a slightly

DarkLight1337

The processor test passes locally, so it should be good to go. Thanks for helping out with the refactoring effort!

DarkLight1337 · 2025-02-04T11:11:08Z

It looks like #12553 broke the basic tests, but the problem got through unnoticed as it was masked by the HF connection error

DarkLight1337 · 2025-02-04T11:13:22Z

I'll fix this in another PR

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Felix Marty <felmarty@amd.com>

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Linkun Chen <github@lkchen.net>

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: saeediy <saidakbarp@gmail.com>

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

impl idefics3 multimodal processor and v1 support

24f730a

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py mentioned this pull request Feb 2, 2025

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12509

Closed

3 tasks

Isotr0py assigned DarkLight1337 Feb 2, 2025

hash dict

cd7f2a5

Signed-off-by: Isotr0py <2037008807@qq.com>

This was referenced Feb 2, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

[RFC]: Merge input processor and input mapper for multi-modal models #10114

Open

Isotr0py added 4 commits February 3, 2025 23:38

update processor test

0dbbc25

Signed-off-by: Isotr0py <2037008807@qq.com>

fix batching

08ee4d1

Signed-off-by: Isotr0py <2037008807@qq.com>

simplify batching

a71a670

Signed-off-by: Isotr0py <2037008807@qq.com>

fix override test

8ed33cd

Signed-off-by: Isotr0py <2037008807@qq.com>

jeejeelee mentioned this pull request Feb 4, 2025

Update to torch==2.6.0 #12721

Merged

fix v1 profiling

fce1417

Signed-off-by: Isotr0py <2037008807@qq.com>

mergify bot added the documentation Improvements or additions to documentation label Feb 4, 2025

Isotr0py added 2 commits February 4, 2025 15:29

migrate model tests

4f2837d

Signed-off-by: Isotr0py <2037008807@qq.com>

clean up

9000639

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py marked this pull request as ready for review February 4, 2025 07:42

Isotr0py requested review from DarkLight1337 and ywang96 as code owners February 4, 2025 07:42

fix mypy

c72b0cb

Signed-off-by: Isotr0py <2037008807@qq.com>

DarkLight1337 reviewed Feb 4, 2025

View reviewed changes

vllm/model_executor/models/idefics3.py Outdated Show resolved Hide resolved

vllm/model_executor/models/idefics3.py Outdated Show resolved Hide resolved

vllm/model_executor/models/idefics3.py Outdated Show resolved Hide resolved

Isotr0py and others added 5 commits February 4, 2025 16:04

Update vllm/model_executor/models/idefics3.py

1635717

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Update vllm/model_executor/models/idefics3.py

370486c

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Update vllm/model_executor/models/idefics3.py

d1f9547

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

fix typo

fcff87f

Signed-off-by: Isotr0py <2037008807@qq.com>

add annotation

99527de

Signed-off-by: Isotr0py <2037008807@qq.com>

mergify bot added the needs-rebase label Feb 4, 2025

fix mm_max_token calculation

06f5de9

Signed-off-by: Isotr0py <2037008807@qq.com>

Merge branch 'main' into v1-idefics3-fix

e5a45d4

mergify bot removed the needs-rebase label Feb 4, 2025

update signature

b6bf1ad

Signed-off-by: Isotr0py <2037008807@qq.com>

DarkLight1337 approved these changes Feb 4, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) February 4, 2025 09:31

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025

youkaichao disabled auto-merge February 4, 2025 12:00

youkaichao merged commit 815079d into vllm-project:main Feb 4, 2025
45 of 50 checks passed

Isotr0py deleted the v1-idefics3-fix branch February 4, 2025 12:09

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[VLM] merged multimodal processor and V1 support for idefics3 (vllm-p…

5750c94

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025

[VLM] merged multimodal processor and V1 support for idefics3 (vllm-p…

72a797a

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025

[VLM] merged multimodal processor and V1 support for idefics3 (vllm-p…

7cf2b10

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

qscqesze pushed a commit to ZZBoom/vllm that referenced this pull request Mar 13, 2025

[VLM] merged multimodal processor and V1 support for idefics3 (vllm-p…

a590fe6

…roject#12660) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

Isotr0py commented Feb 2, 2025 •

edited

Loading

github-actions bot commented Feb 2, 2025

DarkLight1337 commented Feb 4, 2025

mergify bot commented Feb 4, 2025

Isotr0py commented Feb 4, 2025

DarkLight1337 commented Feb 4, 2025

DarkLight1337 left a comment

DarkLight1337 commented Feb 4, 2025

DarkLight1337 commented Feb 4, 2025

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

[VLM] Implement merged multimodal processor and V1 support for idefics3 #12660

Conversation

Isotr0py commented Feb 2, 2025 • edited Loading

github-actions bot commented Feb 2, 2025

DarkLight1337 commented Feb 4, 2025

mergify bot commented Feb 4, 2025

Isotr0py commented Feb 4, 2025

DarkLight1337 commented Feb 4, 2025

DarkLight1337 left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Feb 4, 2025

DarkLight1337 commented Feb 4, 2025

Isotr0py commented Feb 2, 2025 •

edited

Loading