-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't use '--limit-mm-per-prompt' parameters #3054
Comments
when i run with
|
我有同样问题,只能支持两张图片,没有参数可以设置更多 |
我尝试了--limit-mm-per-prompt image=5 这个一样起不来 |
System Info / 系統信息
cuda:12.4
transformers:4.49.0
llama-cpp-python:0.3.7
vllm:0.7.2
python:3.10.14
ubuntu 2204
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
xinference, version 1.3.1.post1
The command used to start Xinference / 用以启动 xinference 的命令
xinference-local --host 0.0.0.0 --port 42131
Reproduction / 复现过程
2025-03-12 09:24:24,862 xinference.core.supervisor 822175 INFO Xinference supervisor 0.0.0.0:18281 started
2025-03-12 09:24:24,891 xinference.core.worker 822175 INFO Starting metrics export server at 0.0.0.0:None
2025-03-12 09:24:24,896 xinference.core.worker 822175 INFO Checking metrics export server...
2025-03-12 09:24:26,820 xinference.core.worker 822175 INFO Metrics server is started at: http://0.0.0.0:40277
2025-03-12 09:24:26,820 xinference.core.worker 822175 INFO Purge cache directory: /feng/code/xinference_home/cache
2025-03-12 09:24:26,821 xinference.core.worker 822175 INFO Connected to supervisor as a fresh worker
2025-03-12 09:24:26,840 xinference.core.worker 822175 INFO Xinference worker 0.0.0.0:18281 started
2025-03-12 09:24:31,639 xinference.api.restful_api 822039 INFO Starting Xinference at endpoint: http://0.0.0.0:42131
2025-03-12 09:24:31,788 uvicorn.error 822039 INFO Uvicorn running on http://0.0.0.0:42131 (Press CTRL+C to quit)
2025-03-12 09:25:46,699 xinference.core.worker 822175 INFO [request fab6af8a-ff23-11ef-abd8-8a8cd53b6db9] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7fa261d32340>, kwargs: model_uid=Qwen2.5-VL-72B-Instruct-AWQ-0,model_name=Qwen2.5-VL-72B-Instruct-AWQ,model_size_in_billions=72,model_format=awq,quantization=INT4,model_engine=vLLM,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[6],download_hub=None,model_path=None,xavier_config=None
2025-03-12 09:25:46,700 xinference.core.worker 822175 INFO You specify to launch the model: Qwen2.5-VL-72B-Instruct-AWQ on GPU index: [6] of the worker: 0.0.0.0:18281, xinference will automatically ignore the
n_gpu
option.2025-03-12 09:25:47,390 xinference.model.llm.llm_family 822175 INFO Caching from URI: /root/.cache/huggingface/hub/Qwen2-VL-72B-Instruct-AWQ
INFO 03-12 09:25:52 init.py:190] Automatically detected platform cuda.
2025-03-12 09:25:53,322 xinference.core.model 822191 INFO Start requests handler.
2025-03-12 09:25:53,328 xinference.model.llm.vllm.core 822191 INFO Loading Qwen2.5-VL-72B-Instruct-AWQ with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': None, 'guided_decoding_backend': 'outlines', 'scheduling_policy': 'fcfs', 'limit_mm_per_prompt': {'image': 2}}Enable lora: False. Lora count: 0.
2025-03-12 09:25:53,339 transformers.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
2025-03-12 09:25:53,339 transformers.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/config.json
2025-03-12 09:25:53,342 transformers.configuration_utils 822191 INFO Model config Qwen2VLConfig {
"_name_or_path": "/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 29696,
"max_position_embeddings": 32768,
"max_window_layers": 80,
"model_type": "qwen2_vl",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"visual"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"hidden_size": 8192,
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}
Model config Qwen2VLConfig {
"_name_or_path": "/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 8192,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 29696,
"max_position_embeddings": 32768,
"max_window_layers": 80,
"model_type": "qwen2_vl",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"visual"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"hidden_size": 8192,
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}
INFO 03-12 09:26:00 config.py:542] This model supports multiple tasks: {'reward', 'generate', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-12 09:26:01 awq_marlin.py:111] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 03-12 09:26:01 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', speculative_config=None, tokenizer='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
2025-03-12 09:26:01,210 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:01,211 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:01,479 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:01,550 transformers.generation.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
2025-03-12 09:26:01,551 transformers.generation.configuration_utils 822191 INFO Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}
Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}
INFO 03-12 09:26:01 cuda.py:230] Using Flash Attention backend.
INFO 03-12 09:26:02 model_runner.py:1110] Starting to load model /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b...
WARNING 03-12 09:26:02 vision.py:94] Current
vllm-flash-attn
has a bug inside vision module, so we use xformers backend instead. You can runpip install flash-attn
to use flash-attention backend.INFO 03-12 09:26:02 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:06, 1.54it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:01<00:05, 1.56it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:01<00:05, 1.55it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:02<00:04, 1.55it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:03<00:03, 1.56it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:03<00:03, 1.57it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:04<00:02, 1.56it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:05<00:01, 1.63it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:05<00:01, 1.61it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:06<00:00, 1.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.57it/s]
INFO 03-12 09:26:14 model_runner.py:1115] Loading model weights took 40.0960 GB
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:14,298 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:14,552 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:14,624 transformers.image_processing_base 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/preprocessor_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/preprocessor_config.json
2025-03-12 09:26:14,624 transformers.models.auto.image_processing_auto 822191 WARNING Using a slow image processor as
use_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.Using a slow image processor as
use_fast
is unset and a slow processor was saved with this model.use_fast=True
will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor withuse_fast=False
.2025-03-12 09:26:14,624 transformers.image_processing_base 822191 INFO Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}
Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file vocab.json
loading file vocab.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file merges.txt
loading file merges.txt
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file tokenizer.json
loading file tokenizer.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file added_tokens.json
loading file added_tokens.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file special_tokens_map.json
loading file special_tokens_map.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file tokenizer_config.json
loading file tokenizer_config.json
2025-03-12 09:26:14,625 transformers.tokenization_utils_base 822191 INFO loading file chat_template.jinja
loading file chat_template.jinja
2025-03-12 09:26:15,264 transformers.tokenization_utils_base 822191 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2025-03-12 09:26:15,720 transformers.processing_utils 822191 INFO Processor Qwen2VLProcessor:
image_processor: Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}
tokenizer: CachedQwen2TokenizerFast(name_or_path='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
{
"processor_class": "Qwen2VLProcessor"
}
Processor Qwen2VLProcessor:
image_processor: Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 1003520,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1003520,
"shortest_edge": 3136
},
"temporal_patch_size": 2,
"vision_token_id": 151654
}
tokenizer: CachedQwen2TokenizerFast(name_or_path='/feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
{
"processor_class": "Qwen2VLProcessor"
}
2025-03-12 09:26:16,344 transformers.models.qwen2_vl.image_processing_qwen2_vl 822191 WARNING It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set
do_rescale=False
to avoid rescaling them again.It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set
do_rescale=False
to avoid rescaling them again.INFO 03-12 09:26:37 worker.py:267] Memory profiling takes 23.33 seconds
INFO 03-12 09:26:37 worker.py:267] the current vLLM instance can use total_gpu_memory (79.33GiB) x gpu_memory_utilization (0.90) = 71.39GiB
INFO 03-12 09:26:37 worker.py:267] model weights take 40.10GiB; non_torch_memory takes 0.16GiB; PyTorch activation peak memory takes 10.17GiB; the rest of the memory reserved for KV Cache is 20.96GiB.
INFO 03-12 09:26:37 executor_base.py:110] # CUDA blocks: 4293, # CPU blocks: 819
INFO 03-12 09:26:37 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 2.10x
INFO 03-12 09:26:39 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilization
or switching to eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:19<00:00, 1.79it/s]
INFO 03-12 09:26:59 model_runner.py:1562] Graph capturing finished in 20 secs, took 1.12 GiB
INFO 03-12 09:26:59 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 45.26 seconds
2025-03-12 09:26:59,398 xinference.core.model 822191 INFO ModelActor(Qwen2.5-VL-72B-Instruct-AWQ-0) loaded
2025-03-12 09:26:59,399 xinference.core.worker 822175 INFO [request fab6af8a-ff23-11ef-abd8-8a8cd53b6db9] Leave launch_builtin_model, elapsed time: 72 s
/usr/local/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/components/chatbot.py:285: UserWarning: You have not specified a value for the
type
parameter. Defaulting to the 'tuples' format for chatbot messages, but this is deprecated and will be removed in a future version of Gradio. Please set type='messages' instead, which uses openai-style dictionaries with 'role' and 'content' keys.warnings.warn(
2025-03-12 09:29:12,287 transformers.generation.configuration_utils 822191 INFO loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
loading configuration file /feng/code/xinference_home/cache/Qwen2.5-VL-72B-Instruct-AWQ-awq-72b/generation_config.json
2025-03-12 09:29:12,288 transformers.generation.configuration_utils 822191 INFO Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}
Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"top_k": 0,
"top_p": 0.01
}
INFO 03-12 09:29:12 async_llm_engine.py:211] Added request 75337c34-ff24-11ef-b6e2-8a8cd53b6db9.
INFO 03-12 09:29:12 async_llm_engine.py:188] Finished request 75337c34-ff24-11ef-b6e2-8a8cd53b6db9.
2025-03-12 09:29:12,291 xinference.core.model 822191 ERROR [request 753373ba-ff24-11ef-b6e2-8a8cd53b6db9] Leave chat, error: You set image=2 (or defaulted to 1) in
--limit-mm-per-prompt
, but passed 3 image items in the same prompt., elapsed time: 0 sWhen I was using the qwen2.5-vl model, I encountered this error, ValueError: [address=0.0.0.0:43315, pid=822191] You set image=2 (or defaulted to 1) in '--limit-mm-per-prompt', but passed 3 image items in the same prompt., but I don't see the --limit-mm-per-prompt in the vllm parameter
Expected behavior / 期待表现
I can set the parameters of vllm myself, do not check it
The text was updated successfully, but these errors were encountered: