Skip to content

Adding VLM pipeline #234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Adding VLM pipeline #234

wants to merge 2 commits into from

Conversation

qcdipankar
Copy link
Contributor

@qcdipankar qcdipankar commented Jan 21, 2025

We were able to create automodel for vlm specifically for phi3-vision

Supports available are

  1. Export to onnx for 1 layer and full
  2. generate
  3. compile
  4. inference on pytorch and run ai 100 is coming correct

TODO
Transformer model Clip needs to be set to eager manually for the test_vlm_model to run

Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
SEQ_LEN = 32
CTX_LEN = 32
SEQ_LEN = 1024
CTX_LEN = 1280
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this here, will affect the causal_lm models, specify another set of constants for vlm models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the above seq_len ctx_len is not generalized value, when considering other vlms, keeping it inside the specific function is better, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in new patch

@@ -45,18 +45,18 @@ def get_models_dir():
QEFF_MODELS_DIR = get_models_dir()

ONNX_EXPORT_EXAMPLE_BATCH_SIZE = 1
ONNX_EXPORT_EXAMPLE_SEQ_LEN = 32
ONNX_EXPORT_EXAMPLE_SEQ_LEN = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please verify and make sure the existing causalLM pipeline is not broken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in new patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep name test_image_text_to_text_models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in new patch

# raise TypeError("missing required argument: 'full_batch_size'")

# if kv_cache_batch_size and not full_batch_size:
# raise ValueError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid any commented lines in code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in new patch

@quic-rishinr
Copy link
Contributor

Please add documentation and an example script.

@quic-amitraj quic-amitraj marked this pull request as draft January 23, 2025 10:45
@quic-amitraj quic-amitraj self-requested a review January 23, 2025 17:39
Copy link
Contributor

@quic-amitraj quic-amitraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed code cleaning yet to be done. Please address all the comments and remove all the commented and unnecessary lines. Also update the docstring accordingly.

from QEfficient import QEFFAutoModelForImageTextToText
from transformers import AutoTokenizer

model_name = "llava"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update Docstrings.

# warnings.warn(
# "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2
# )
# breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented codes.

from transformers import AutoTokenizer

# Initialize the model using from_pretrained similar to transformers.AutoModelForCausalLM
model_name = "gpt2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here as well.

@quic-akuruvil
Copy link
Contributor

Verify the PR on 4.46.0 transformers version

@@ -24,6 +34,8 @@
from QEfficient.transformers.quantizers.auto import QEFF_AUTO_QUANTIZATION_CONFIG_MAPPING, with_replaced_quantizers
from QEfficient.transformers.quantizers.quant_transforms import AwqToMatmulNbitsTransform, GPTQToMatmulNbitsTransform
from QEfficient.utils import constants, get_padding_shape_from_config

# from QEfficient.transformers.models.phi3_vision.modeling_phi3_vision import Phi3VModelWrapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused imports

@@ -421,6 +433,485 @@ def generate(
raise NotImplementedError("Only AI_100 runtime is supported right now via generate API")


class QEFFAutoModelForImageTextToText(QEFFTransformersBase):
"""
The QEFF class is designed for manipulating any causal language model from the HuggingFace hub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the doc string. Doc string is referring to causal model

from transformers import AutoTokenizer

model_name = "llava"
model = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_hidden_layers=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it using QEFFAutoModelForCausalLM? Can you update it with right scripts?

# warnings.warn(
# "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2
# )
# breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the debugging/commented lines

self.continuous_batching = continuous_batching
self.is_tlm = is_tlm
self.pad_token_id = model.config.pad_token_id
self.ctx_len = 1280
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Ctx len is hardcoded?

:ctx_len (int): Maximum context length to compile the model.
:n_layers (int): Number of layers for the Model.
"""
# replace_transformers_quantizers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unwanted commented lines

streamer = TextStreamer(processor)
# Testing for Phi-3.5 only atm
inputs = _generate_inputs(model_hf, processor)
breakpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove break point

num_hidden_layers=n_layer,
_attn_implementation="eager",
trust_remote_code=True,
# Check if this works
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove unwanted commented line from this method

model_hf, _ = load_vlm_model(model_config)
# Load processor instead
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# config = model_hf.config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unwanted commented line

qeff_model.generate(inputs, streamer, device_ids=[0], runtime_ai100=False)
# cloud_ai_100_tokens = exec_info[0] # Because we always run for single input and single batch size
# gen_len = ort_tokens.shape[-1]
# assert (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all asserts commented out? The objective of this method is to test the output correctness of native pytorch model output vs transformed pytorch output vs ORT output vs ai100 output. Please do add all the tests back.

Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
from QEfficient.utils import hf_download


def load_vlm_model(model_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name to image_text_to_text

)
inputs["attention_mask"] = torch.nn.functional.pad(
inputs["attention_mask"], (0, 1024 - input_ids_size), "constant", 0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values 1024 have to be replaced with constant.seq_len.

@quic-amitraj
Copy link
Contributor

Already addressed in #267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants