-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[NVPTX] Make i16x2 a native type and add supported vec instructions #65432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6595701
to
de49dce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch.
It looks fine in general. One thing I'd suggest is to extract generic tablegen changes (adding specific parameter types) into a separate patch so they do not clutter the v2i16 changes.
…porting it On sm_90 some instructions now support i16x2 which allows hardware to execute more efficiently add, min and max instructions. In order to support that we need to make i16x2 a native type in the backend. This does the necessary changes to make i16x2 a native type and adds support for the instructions natively supporting i16x2. This caused a negative test in nvptx slp to start passing. Changed the test to a positive one as the IR is correctly vectorized.
de49dce
to
1a4acc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I rebased on top of the tablegen changes and addressed the comments.
llvm/test/Transforms/SLPVectorizer/NVPTX/non-vectorizable-intrinsic-inseltpoison.ll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
Looks like we've missed lowering of bitcasts between v2f16 and v2i16 and it breaks XLA.
|
Oops, I can send a patch today unless someone else already has a fix. |
Thank you @Artem-B for the breakage analysis! @ThomasRaoux I hope you don't mind, I'm going to revert this change to unbreak XLA. |
Can we fix it forward? I'll prepare a patch. |
We'll probably also need similar bitcast lowering for v2bf16. I suspect now that v2i16 are available we'll see LLVM picking them for various loads/stores and bf16 will likely end up using that path, too. Not sure if we'll need bitcasts between v2f16 and v2bf16 in practice, but it would not hurt to add them just in case, too. |
sure, let me add those. |
The fix itself should be simple (tablegen pattern converting bitcast to a proxyreg?) but we'll also need tests. Should be doable, but I don't know @gribozavr 's time constraints. |
In any case, it probably does not matter much one way or another -- it's just one more cherry-picked revision to include into a new pull request. |
looking at it we do have some tests for bitcast 2xhalf -> 2xi16:
|
Hmm. Why did they work then? Loads/stores/bitcasts for v2i16 should not have dependent on whether the target is sm_90 or not. Or maybe it did, if we set lowering action to Expand on older GPUs. You may try running the tests with |
I tried with |
Another theory is that the bitcast in the tests didn't actually make it to the lowering and we only had to deal with loads/stores.
We may need something more elaborate, like manually constructing v2i16 from an input i16. |
The HEAD has been broken for us for more than a day, and we need to get to green ASAP. I pushed the revert. |
Aha. This reproduces the problem: https://godbolt.org/z/K1c4PqYoP |
Thanks for the repro. I'm going to land it again with the fix (2bd3ce7) |
…lvm#65432) On sm_90 some instructions now support i16x2 which allows hardware to execute more efficiently add, min and max instructions. In order to support that we need to make i16x2 a native type in the backend. This does the necessary changes to make i16x2 a native type and adds support for the instructions natively supporting i16x2. This caused a negative test in nvptx slp to start passing. Changed the test to a positive one as the IR is correctly vectorized.
This causes a nearly ~2x regression in many Triton kernels on Ampere, we'll post a reproducer. |
…lvm#65799) recommit llvm#65432 with minor bug fix for bitcasts
…#65799) recommit llvm/llvm-project#65432 with minor bug fix for bitcasts
On sm_90 some instructions now support i16x2 which allows hardware to execute more efficiently add, min and max instructions.
In order to support that we need to make i16x2 a native type in the backend. This does the necessary changes to make i16x2 a native type and adds support for the instructions natively supporting i16x2.
This caused a negative test in nvptx slp to start passing. Changed the test to a positive one as the IR is correctly vectorized.