-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[docs] Model cards #11112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[docs] Model cards #11112
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
9cda27b
to
b31ed8e
Compare
Looks good on first impression, I will review it in depth later today, wanted to raise #10301 with you as it will help simplify the examples (in combination with #11130 for the quantization cases). Also, it would be cool to have the examples be configurable/update with options, to demonstrate here's an artist's (4o) impression of what it could look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good start. Left some comments, let me know if they make sense.
|:---:|:---:| | ||
| [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V) | torch.bfloat16 | | ||
| [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V) | torch.bfloat16 | | ||
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay but I would perhaps tackle the removal of the abstract section in a separate PR. Also, this does add an additional overload of coming up with a description for the paper. I would like to avoid that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to also tackle this now since for the new pipeline cards, we want to have a nice and complete example of what it should look like no?
Good point that adding a description of the paper adds additional overload, but I think its necessary, since we want to give users a version of the abstract that is more accessible (meaning using common everyday language) versus academic (inspired by @asomoza 's comment here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit spread thin on this one. So, I will go with what the team prefers.
``` | ||
Without torch.compile(): Average inference time: 96.89 seconds. | ||
With torch.compile(): Average inference time: 76.27 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this info since its really only relevant to users who have that specific machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't you think it gives a ballpark on the performance improvements in % which can be important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You right, I added it back and included a table with expected memory usage!
|
||
## Available models | ||
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we first update / ensure this guide is updated with all the latest that we have in the library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah check out #11385!
Thanks @hlky, those PRs look to be super nice for user experience and I'll update the code examples once it's merged! The configurable example is also really neat and maybe we can make a Space out of it and embed it in the docs? I'll probably have to follow up on this one in a separate PR though 😅 |
@stevhliu sorry for the delay on my end. The changes look nice and I responded to some of the questions/comments you had. Perhaps after #11130, we could simplify the quantization examples a bit. @a-r-r-o-w do we want to touch any other video models in this PR? |
🚧 WIP 🚧Based on our discussions about making it easier to run video models by including some minimal code optimized for memory and inference speed, this PR refactors the model card (starting with CogVideoX, but eventually expanding to other models as well) to reflect that. This provides users with easy copy/paste code they can run.
Parallel to this effort is to also improve the generic video generation guide.