-
Notifications
You must be signed in to change notification settings - Fork 44
Adding VLM pipeline #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding VLM pipeline #234
Conversation
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
SEQ_LEN = 32 | ||
CTX_LEN = 32 | ||
SEQ_LEN = 1024 | ||
CTX_LEN = 1280 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this here, will affect the causal_lm
models, specify another set of constants for vlm models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the above seq_len
ctx_len
is not generalized value, when considering other vlms, keeping it inside the specific function is better, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in new patch
@@ -45,18 +45,18 @@ def get_models_dir(): | |||
QEFF_MODELS_DIR = get_models_dir() | |||
|
|||
ONNX_EXPORT_EXAMPLE_BATCH_SIZE = 1 | |||
ONNX_EXPORT_EXAMPLE_SEQ_LEN = 32 | |||
ONNX_EXPORT_EXAMPLE_SEQ_LEN = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please verify and make sure the existing causalLM pipeline is not broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in new patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep name test_image_text_to_text_models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in new patch
# raise TypeError("missing required argument: 'full_batch_size'") | ||
|
||
# if kv_cache_batch_size and not full_batch_size: | ||
# raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid any commented lines in code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved in new patch
Please add documentation and an example script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed code cleaning yet to be done. Please address all the comments and remove all the commented and unnecessary lines. Also update the docstring accordingly.
from QEfficient import QEFFAutoModelForImageTextToText | ||
from transformers import AutoTokenizer | ||
|
||
model_name = "llava" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update Docstrings.
# warnings.warn( | ||
# "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2 | ||
# ) | ||
# breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented codes.
from transformers import AutoTokenizer | ||
|
||
# Initialize the model using from_pretrained similar to transformers.AutoModelForCausalLM | ||
model_name = "gpt2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update here as well.
Verify the PR on |
@@ -24,6 +34,8 @@ | |||
from QEfficient.transformers.quantizers.auto import QEFF_AUTO_QUANTIZATION_CONFIG_MAPPING, with_replaced_quantizers | |||
from QEfficient.transformers.quantizers.quant_transforms import AwqToMatmulNbitsTransform, GPTQToMatmulNbitsTransform | |||
from QEfficient.utils import constants, get_padding_shape_from_config | |||
|
|||
# from QEfficient.transformers.models.phi3_vision.modeling_phi3_vision import Phi3VModelWrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unused imports
@@ -421,6 +433,485 @@ def generate( | |||
raise NotImplementedError("Only AI_100 runtime is supported right now via generate API") | |||
|
|||
|
|||
class QEFFAutoModelForImageTextToText(QEFFTransformersBase): | |||
""" | |||
The QEFF class is designed for manipulating any causal language model from the HuggingFace hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the doc string. Doc string is referring to causal model
from transformers import AutoTokenizer | ||
|
||
model_name = "llava" | ||
model = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_hidden_layers=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it using QEFFAutoModelForCausalLM? Can you update it with right scripts?
# warnings.warn( | ||
# "full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2 | ||
# ) | ||
# breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the debugging/commented lines
self.continuous_batching = continuous_batching | ||
self.is_tlm = is_tlm | ||
self.pad_token_id = model.config.pad_token_id | ||
self.ctx_len = 1280 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Ctx len is hardcoded?
:ctx_len (int): Maximum context length to compile the model. | ||
:n_layers (int): Number of layers for the Model. | ||
""" | ||
# replace_transformers_quantizers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unwanted commented lines
streamer = TextStreamer(processor) | ||
# Testing for Phi-3.5 only atm | ||
inputs = _generate_inputs(model_hf, processor) | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove break point
num_hidden_layers=n_layer, | ||
_attn_implementation="eager", | ||
trust_remote_code=True, | ||
# Check if this works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove unwanted commented line from this method
model_hf, _ = load_vlm_model(model_config) | ||
# Load processor instead | ||
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) | ||
# config = model_hf.config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unwanted commented line
qeff_model.generate(inputs, streamer, device_ids=[0], runtime_ai100=False) | ||
# cloud_ai_100_tokens = exec_info[0] # Because we always run for single input and single batch size | ||
# gen_len = ort_tokens.shape[-1] | ||
# assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are all asserts commented out? The objective of this method is to test the output correctness of native pytorch model output vs transformed pytorch output vs ORT output vs ai100 output. Please do add all the tests back.
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
from QEfficient.utils import hf_download | ||
|
||
|
||
def load_vlm_model(model_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change name to image_text_to_text
) | ||
inputs["attention_mask"] = torch.nn.functional.pad( | ||
inputs["attention_mask"], (0, 1024 - input_ids_size), "constant", 0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values 1024 have to be replaced with constant.seq_len.
Already addressed in #267 |
We were able to create automodel for vlm specifically for phi3-vision
Supports available are
TODO
Transformer model Clip needs to be set to eager manually for the test_vlm_model to run