-
Notifications
You must be signed in to change notification settings - Fork 44
VLM Pipeline (Intern,LLava) #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
return generate_func(**kwargs) | ||
|
||
def generate_inputs_intern(self, **kwargs): | ||
bs: int = constants.ONNX_EXPORT_EXAMPLE_BATCH_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this inside the modeling file.
return inputs, dynamic_axes, output_names | ||
|
||
def generate_inputs_llava(self, **kwargs): | ||
bs: int = constants.ONNX_EXPORT_EXAMPLE_BATCH_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this inside the modeling file modelling_llava.py
# ) | ||
# num_logits_to_keep = num_speculative_tokens + 1 | ||
# if prefill_seq_len < num_logits_to_keep: | ||
# raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented lines.
generation_len = self.ctx_len - input_len.max() # in standalone this is tensor | ||
assert generation_len > 0, "generation length should be greater than zero" | ||
generated_ids = np.full((batch_size, generation_len + 1), self.processor.tokenizer.pad_token_id) | ||
# inputs["input_ids"]=torch.nn.functional.pad(inputs["input_ids"],(0,self.seq_len_constant-inputs["input_ids"].size(1)),"constant",self.pad_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
PROMPT_LEN = 8 | ||
INPUT_STR = ["My name is"] | ||
GB = 2**30 | ||
MAX_QPC_LIMIT = 30 | ||
MAX_RETRIES = 5 # This constant will be used set the maximum number of retry attempts for downloading a model using huggingface_hub snapshot_download | ||
NUM_SPECULATIVE_TOKENS = 2 | ||
CTX_LEN_VLM_LLAVA = 1280 | ||
IMG_SIZE = 336 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using these at the time of export to define shapes?
|
||
|
||
# if __name__ == "__main__": | ||
# # model_name = "OpenGVLab/InternVL2_5-1B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented parts.
@@ -251,6 +252,7 @@ def _compile( | |||
if num_speculative_tokens: | |||
compile_hash.update(to_hashable({"num_speculative_tokens": num_speculative_tokens})) | |||
|
|||
# import ipdb; ipdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these lines.
if hasattr(module, "__qeff_init__"): | ||
module.__qeff_init__() | ||
transformed = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine both if conditions?
input_ids_size = input_ids.shape[1] | ||
# attention_mask = inputs["attention_mask"] | ||
inputs["input_ids"] = torch.nn.functional.pad( | ||
inputs["input_ids"], (0, 3072 - input_ids_size), "constant", self.processor.tokenizer.pad_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid hardcoded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this value generic, and fetch from qpc session - prefill_seq_len
. For whichever value it was compiled for.
breakpoint() | ||
self.model.config.use_cache = True | ||
self.processor = processor | ||
self.num_layers = model.config.text_config.num_hidden_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make fetching num_layers
generic. Also the padding shape. Please refer llava PR, use a similar function, which fetches based on model architecture.
Already addressed in #267. |
Added Generic Framework to onboard and run VLMs in QEff