-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[docs] Memory optims #11385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Memory optims #11385
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initiative! I left some minor comments, let me know if they make sense.
Thanks for the feedback on the memory doc! I also updated the inference speed doc, so please feel free to check it out and leave some feedback ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the further updates. Left some further comments. LMK if they make sense.
docs/source/en/optimization/fp16.md
Outdated
prompt = "slice of delicious New York-style cheesecake topped with berries, mint, chocolate crumble" | ||
pipeline(prompt, num_inference_steps=50).images[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also have an advanced guide on accelerating inference that make use of other techniques like SAGE, multi-GPU inference (context-parallel), etc. @a-r-r-o-w WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to add in a separate PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Not sure how we plan to show context parallel + Sage because the implementation is based on PR in finetrainers. Do we want to showcase that in the docs?
Or do you want me to take just the relevant parts and create a minimal attention processor to showcase the example? I can send over a rough writeup to Steven for whichever we choose to do
Note that the latter will only yield a small amount of speedup compared to using the current finetrainers implementation (because it shards the sequence dim early on instead of just for the attention, which is not possible to showcase in only-diffusers example because it either involves rewriting the forward method OR adds the overhead of explaining model hooks to reader and making that part of the example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm got it. Sorry for being unclear.
If it's not too much would it be possible to send over a write-up covering both cases? And we can then further brainstorm how to best disseminate the information to the readers? But really no rush here. Do it whenever you get time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks crisp!
docs/source/en/optimization/fp16.md
Outdated
</div> | ||
## Distilled models | ||
|
||
Another option for accelerating inference is to use a smaller distilled model if it's available. During distillation, many of the UNet's residual and attention blocks are discarded to reduce model size and improve latency. A distilled model is faster and uses less memory without compromising quality compared to a full-sized model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevhliu I thought we were removing these sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, thought I already had! Should be gone now :)
|
||
The `offload_type` parameter can be set to `block_level` or `leaf_level`. | ||
|
||
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (2o total onloads/offloads). This drastically reduces memory requirements. | |
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol good eyes! 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, my earlier comments have been addressed so LGTM apart from remaining Sayak's comments!
Refactors the memory optimization docs and combines it with working with big models (distributed setups).
Let me know if I'm missing anything!