Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't use '--limit-mm-per-prompt' parameters #3054

Open
1 of 3 tasks
liuxingbo12138 opened this issue Mar 13, 2025 · 3 comments
Open
1 of 3 tasks

can't use '--limit-mm-per-prompt' parameters #3054

liuxingbo12138 opened this issue Mar 13, 2025 · 3 comments
Labels
Milestone

Comments

@liuxingbo12138
Copy link

System Info / 系統信息

cuda:12.4
transformers:4.49.0
llama-cpp-python:0.3.7
vllm:0.7.2
python:3.10.14
ubuntu 2204

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

xinference, version 1.3.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 42131

Reproduction / 复现过程

2025-03-12 09:24:24,862 xinference.core.supervisor 822175 INFO Xinference supervisor 0.0.0.0:18281 started
2025-03-12 09:24:24,891 xinference.core.worker 822175 INFO Starting metrics export server at 0.0.0.0:None
2025-03-12 09:24:24,896 xinference.core.worker 822175 INFO Checking metrics export server...
2025-03-12 09:24:26,820 xinference.core.worker 822175 INFO Metrics server is started at: http://0.0.0.0:40277
2025-03-12 09:24:26,820 xinference.core.worker 822175 INFO Purge cache directory: /feng/code/xinference_home/cache
2025-03-12 09:24:26,821 xinference.core.worker 822175 INFO Connected to supervisor as a fresh worker
2025-03-12 09:24:26,840 xinference.core.worker 822175 INFO Xinference worker 0.0.0.0:18281 started
2025-03-12 09:24:31,639 xinference.api.restful_api 822039 INFO Starting Xinference at endpoint: http://0.0.0.0:42131
2025-03-12 09:24:31,788 uvicorn.error 822039 INFO Uvicorn running on http://0.0.0.0:42131 (Press CTRL+C to quit)
2025-03-12 09:25:46,699 xinference.core.worker 822175 INFO [request fab6af8a-ff23-11ef-abd8-8a8cd53b6db9] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7fa261d32340>, kwargs: model_uid=Qwen2.5-VL-72B-Instruct-AWQ-0,model_name=Qwen2.5-VL-72B-Instruct-AWQ,model_size_in_billions=72,model_format=awq,quantization=INT4,model_engine=vLLM,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[6],download_hub=None,model_path=None,xavier_config=None
2025-03-12 09:25:46,700 xinference.core.worker 822175 INFO You specify to launch the model: Qwen2.5-VL-72B-Instruct-AWQ on GPU index: [6] of the worker: 0.0.0.0:18281, xinference will automatically ignore the n_gpu option.
2025-03-12 09:25:47,390 xinference.model.llm.llm_family 822175 INFO Caching from URI: /root/.cache/huggingface/hub/Qwen2-VL-72B-Instruct-AWQ
INFO 03-12 09:25:52 init.py:190] Automatically detected platform cuda.
2025-03-12 09:25:53,322 xinference.core.model 822191 INFO Start requests handler.
2025-03-12 09:25:53,328 xinference.model.llm.vllm.core 822191 INFO Loading Qwen2.5-VL-72B-Instruct-AWQ with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': None, 'guided_decoding_backend': 'outlines', 'scheduling_policy': 'fcfs', 'limit_mm_per_prompt': {'image': 2}}Enable lora: False. Lora count: 0.
2025-03-12 09:25:53,339 transformers.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
2025-03-12 09:25:53,339 transformers.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
2025-03-12 09:25:53,342 transformers.configuration_utils 822191 INFO Model config Qwen2VLConfig {
"_name_or_path": "/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 29696,
"max_position_embeddings": 32768,
"max_window_layers": 80,
"model_type": "qwen2_vl",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"visual"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"hidden_size": 8192,
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

Model config Qwen2VLConfig {
"_name_or_path": "/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 29696,
"max_position_embeddings": 32768,
"max_window_layers": 80,
"model_type": "qwen2_vl",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"visual"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"hidden_size": 8192,
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

INFO 03-12 09:26:00 config.py:542] This model supports multiple tasks: {'reward', 'generate', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-12 09:26:01 awq_marlin.py:111] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 03-12 09:26:01 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', speculative_config=None, tokenizer='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
2025-03-12 09:26:01,210 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:01,479 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:01,550 transformers.generation.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
2025-03-12 09:26:01,551 transformers.generation.configuration_utils 822191 INFO Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}

Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}

INFO 03-12 09:26:01 cuda.py:230] Using Flash Attention backend.
INFO 03-12 09:26:02 model_runner.py:1110] Starting to load model /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b...
WARNING 03-12 09:26:02 vision.py:94] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
INFO 03-12 09:26:02 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:06, 1.54it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:01<00:05, 1.56it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:01<00:05, 1.55it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:02<00:04, 1.55it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:03<00:03, 1.56it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:03<00:03, 1.57it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:04<00:02, 1.56it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:05<00:01, 1.63it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:05<00:01, 1.61it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:06<00:00, 1.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.57it/s]

INFO 03-12 09:26:14 model_runner.py:1115] Loading model weights took 40.0960 GB
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:14,552 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:14,624 transformers.image_processing_base 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/preprocessor_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/preprocessor_config.json
2025-03-12 09:26:14,624 transformers.models.auto.image_processing_auto 822191 WARNING Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
2025-03-12 09:26:14,624 transformers.image_processing_base 822191 INFO Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}

Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}

2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:15,264 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:15,720 transformers.processing_utils 822191 INFO Processor Qwen2VLProcessor:

  • image_processor: Qwen2VLImageProcessor {
    "do_convert_rgb": true,
    "do_normalize": true,
    "do_rescale": true,
    "do_resize": true,
    "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
    ],
    "image_processor_type": "Qwen2VLImageProcessor",
    "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
    ],
    "max_pixels": 1003520,
    "merge_size": 2,
    "min_pixels": 3136,
    "patch_size": 14,
    "processor_class": "Qwen2VLProcessor",
    "resample": 3,
    "rescale_factor": 0.00392156862745098,
    "size": {
    "longest_edge": 1003520,
    "shortest_edge": 3136
    },
    "temporal_patch_size": 2,
    "vision_token_id": 151654
    }

  • tokenizer: CachedQwen2TokenizerFast(name_or_path='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
    151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    }
    )

{
"processor_class": "Qwen2VLProcessor"
}

Processor Qwen2VLProcessor:

  • image_processor: Qwen2VLImageProcessor {
    "do_convert_rgb": true,
    "do_normalize": true,
    "do_rescale": true,
    "do_resize": true,
    "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
    ],
    "image_processor_type": "Qwen2VLImageProcessor",
    "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
    ],
    "max_pixels": 1003520,
    "merge_size": 2,
    "min_pixels": 3136,
    "patch_size": 14,
    "processor_class": "Qwen2VLProcessor",
    "resample": 3,
    "rescale_factor": 0.00392156862745098,
    "size": {
    "longest_edge": 1003520,
    "shortest_edge": 3136
    },
    "temporal_patch_size": 2,
    "vision_token_id": 151654
    }

  • tokenizer: CachedQwen2TokenizerFast(name_or_path='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
    151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    }
    )

{
"processor_class": "Qwen2VLProcessor"
}

2025-03-12 09:26:16,344 transformers.models.qwen2_vl.image_processing_qwen2_vl 822191 WARNING It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set do_rescale=False to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set do_rescale=False to avoid rescaling them again.
INFO 03-12 09:26:37 worker.py:267] Memory profiling takes 23.33 seconds
INFO 03-12 09:26:37 worker.py:267] the current vLLM instance can use total_gpu_memory (79.33GiB) x gpu_memory_utilization (0.90) = 71.39GiB
INFO 03-12 09:26:37 worker.py:267] model weights take 40.10GiB; non_torch_memory takes 0.16GiB; PyTorch activation peak memory takes 10.17GiB; the rest of the memory reserved for KV Cache is 20.96GiB.
INFO 03-12 09:26:37 executor_base.py:110] # CUDA blocks: 4293, # CPU blocks: 819
INFO 03-12 09:26:37 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 2.10x
INFO 03-12 09:26:39 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:19<00:00, 1.79it/s]
INFO 03-12 09:26:59 model_runner.py:1562] Graph capturing finished in 20 secs, took 1.12 GiB
INFO 03-12 09:26:59 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 45.26 seconds
2025-03-12 09:26:59,398 xinference.core.model 822191 INFO ModelActor(Qwen2.5-VL-72B-Instruct-AWQ-0) loaded
2025-03-12 09:26:59,399 xinference.core.worker 822175 INFO [request fab6af8a-ff23-11ef-abd8-8a8cd53b6db9] Leave launch_builtin_model, elapsed time: 72 s
/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/components/chatbot.py:285: UserWarning: You have not specified a value for the type parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.
warnings.warn(
2025-03-12 09:29:12,287 transformers.generation.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
2025-03-12 09:29:12,288 transformers.generation.configuration_utils 822191 INFO Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}

Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}

INFO 03-12 09:29:12 async_llm_engine.py:211] Added request 75337c34-ff24-11ef-b6e2-8a8cd53b6db9.
INFO 03-12 09:29:12 async_llm_engine.py:188] Finished request 75337c34-ff24-11ef-b6e2-8a8cd53b6db9.
2025-03-12 09:29:12,291 xinference.core.model 822191 ERROR [request 753373ba-ff24-11ef-b6e2-8a8cd53b6db9] Leave chat, error: You set image=2 (or defaulted to 1) in --limit-mm-per-prompt, but passed 3 image items in the same prompt., elapsed time: 0 s

When I was using the qwen2.5-vl model, I encountered this error, ValueError: [address=0.0.0.0:43315, pid=822191] You set image=2 (or defaulted to 1) in '--limit-mm-per-prompt', but passed 3 image items in the same prompt., but I don't see the --limit-mm-per-prompt in the vllm parameter

Expected behavior / 期待表现

I can set the parameters of vllm myself, do not check it

@XprobeBot XprobeBot added the gpu label Mar 13, 2025
@XprobeBot XprobeBot added this to the v1.x milestone Mar 13, 2025
@liuxingbo12138
Copy link
Author

when i run with xinference launch --model-name Qwen2.5-VL-72B-Instruct-AWQ --model-type LLM --model-engine vLLM --model-format awq --size-in-billions 72 --quantization INT4 --n-gpu auto --replica 1 --n-worker 1 --gpu-idx 6 --limit-mm-per-prompt 5, it reports another error AsyncEngineArgs.__init__() got an unexpected keyword argument 'limit-mm-per-prompt', so i modify the command to 'xinference launch --model-name Qwen2.5-VL-72B-Instruct-AWQ --model-type LLM --model-engine vLLM --model-format awq --size-in-billions 72 --quantization INT4 --n-gpu auto --replica 1 --n-worker 1 --gpu-idx 6 --limit_mm_per_prompt 5',It still doesn't work correctly,the error is reported as

Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1190, in launch_builtin_model
    await _launch_model()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1125, in _launch_model
    subpool_address = await _launch_one_model(
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1083, in _launch_one_model
    subpool_address = await worker_ref.launch_builtin_model(
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 93, in wrapped
    ret = await func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 466, in load
    self._model.load()
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 274, in load
    self._model_config = self._sanitize_model_config(self._model_config)
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 916, in _sanitize_model_config
    json.loads(model_config.get("limit_mm_per_prompt"))  # type: ignore
  File "/usr/local/miniconda3/envs/xinference/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: [address=0.0.0.0:36033, pid=853310] the JSON object must be str, bytes or bytearray, not int

@pyaaaa
Copy link

pyaaaa commented Mar 13, 2025

我有同样问题,只能支持两张图片,没有参数可以设置更多

@pyaaaa
Copy link

pyaaaa commented Mar 13, 2025

我尝试了--limit-mm-per-prompt image=5 这个一样起不来

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants