Skip to content

[NVPTX] Make i16x2 a native type and add supported vec instructions #65432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 7, 2023

Conversation

ThomasRaoux
Copy link
Contributor

On sm_90 some instructions now support i16x2 which allows hardware to execute more efficiently add, min and max instructions.

In order to support that we need to make i16x2 a native type in the backend. This does the necessary changes to make i16x2 a native type and adds support for the instructions natively supporting i16x2.

This caused a negative test in nvptx slp to start passing. Changed the test to a positive one as the IR is correctly vectorized.

@ThomasRaoux ThomasRaoux force-pushed the nvptx_i16x2_2 branch 2 times, most recently from 6595701 to de49dce Compare September 6, 2023 18:38
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch.
It looks fine in general. One thing I'd suggest is to extract generic tablegen changes (adding specific parameter types) into a separate patch so they do not clutter the v2i16 changes.

…porting it

On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.

In order to support that we need to make i16x2 a native type in the backend.
This does the necessary changes to make i16x2 a native type and adds support
for the instructions natively supporting i16x2.

This caused a negative test in nvptx slp to start passing. Changed the test
to a positive one as the IR is correctly vectorized.
Copy link
Contributor Author

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I rebased on top of the tablegen changes and addressed the comments.

@ThomasRaoux ThomasRaoux requested a review from Artem-B September 6, 2023 21:27
@ThomasRaoux ThomasRaoux requested a review from Artem-B September 6, 2023 22:30
@ThomasRaoux ThomasRaoux requested a review from Artem-B September 6, 2023 23:11
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@ThomasRaoux ThomasRaoux merged commit db5d845 into llvm:main Sep 7, 2023
@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Looks like we've missed lowering of bitcasts between v2f16 and v2i16 and it breaks XLA.

LLVM ERROR: Cannot select: t119: v2f16 = bitcast t118
  t118: v2i16 = or t375, t401
    t375: v2i16 = BUILD_VECTOR t374, t372
      t374: i16 = select t247, Constant:i16<8960>, t360

@ThomasRaoux
Copy link
Contributor Author

Looks like we've missed lowering of bitcasts between v2f16 and v2i16 and it breaks XLA.

LLVM ERROR: Cannot select: t119: v2f16 = bitcast t118
  t118: v2i16 = or t375, t401
    t375: v2i16 = BUILD_VECTOR t374, t372
      t374: i16 = select t247, Constant:i16<8960>, t360

Oops, I can send a patch today unless someone else already has a fix.

@gribozavr
Copy link
Collaborator

Thank you @Artem-B for the breakage analysis!

@ThomasRaoux I hope you don't mind, I'm going to revert this change to unbreak XLA.

@ThomasRaoux
Copy link
Contributor Author

Thank you @Artem-B for the breakage analysis!

@ThomasRaoux I hope you don't mind, I'm going to revert this change to unbreak XLA.

Can we fix it forward? I'll prepare a patch.

@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

We'll probably also need similar bitcast lowering for v2bf16. I suspect now that v2i16 are available we'll see LLVM picking them for various loads/stores and bf16 will likely end up using that path, too. Not sure if we'll need bitcasts between v2f16 and v2bf16 in practice, but it would not hurt to add them just in case, too.

@ThomasRaoux
Copy link
Contributor Author

We'll probably also need similar bitcast lowering for v2bf16. I suspect now that v2i16 are available we'll see LLVM picking them for various loads/stores and bf16 will likely end up using that path, too. Not sure if we'll need bitcasts between v2f16 and v2bf16 in practice, but it would not hurt to add them just in case, too.

sure, let me add those.

@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Can we fix it forward? I'll prepare a patch.

The fix itself should be simple (tablegen pattern converting bitcast to a proxyreg?) but we'll also need tests. Should be doable, but I don't know @gribozavr 's time constraints.

@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

In any case, it probably does not matter much one way or another -- it's just one more cherry-picked revision to include into a new pull request.

@ThomasRaoux
Copy link
Contributor Author

looking at it we do have some tests for bitcast 2xhalf -> 2xi16:

define <2 x i16> @test_bitcast_2xhalf_to_2xi16(<2 x half> %a) #0 {

@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Hmm. Why did they work then? Loads/stores/bitcasts for v2i16 should not have dependent on whether the target is sm_90 or not. Or maybe it did, if we set lowering action to Expand on older GPUs.

You may try running the tests with -mcpu=sm_90

@ThomasRaoux
Copy link
Contributor Author

Hmm. Why did they work then? Loads/stores/bitcasts for v2i16 should not have dependent on whether the target is sm_90 or not. Or maybe it did, if we set lowering action to Expand on older GPUs.

You may try running the tests with -mcpu=sm_90

I tried with -mcpu=sm_90 -mattr=+ptx80 and without it and I can't reproduce the problem.

gribozavr added a commit that referenced this pull request Sep 8, 2023
…ctions (#65432)"

This reverts commit db5d845.

As per PR discussion "Looks like we've missed lowering of bitcasts
between v2f16 and v2i16 and it breaks XLA."
@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Another theory is that the bitcast in the tests didn't actually make it to the lowering and we only had to deal with loads/stores.
https://godbolt.org/z/81e31jj8a
nvptx-isel appears to convert IR straight into

# *** IR Dump After NVPTX DAG->DAG Pattern Instruction Selection (nvptx-isel) ***:
# Machine code for function test_bitcast_2xhalf_to_2xi16: IsSSA, TracksLiveness

bb.0 (%ir-block.0):
  %0:int32regs = LD_i32_avar 0, 4, 1, 0, 32, &test_bitcast_2xhalf_to_2xi16_param_0 :: (dereferenceable invariant load (s32) from `ptr addrspace(101) null`, addrspace 101)
  StoreRetvalI32 killed %0:int32regs, 0 :: (store (s32), align 1)
  Return

We may need something more elaborate, like manually constructing v2i16 from an input i16.

@gribozavr
Copy link
Collaborator

gribozavr commented Sep 8, 2023

Can we fix it forward? I'll prepare a patch.

The HEAD has been broken for us for more than a day, and we need to get to green ASAP. I pushed the revert.

@Artem-B
Copy link
Member

Artem-B commented Sep 8, 2023

Aha. This reproduces the problem: https://godbolt.org/z/K1c4PqYoP

@ThomasRaoux
Copy link
Contributor Author

Aha. This reproduces the problem: https://godbolt.org/z/K1c4PqYoP

Thanks for the repro. I'm going to land it again with the fix (2bd3ce7)

ThomasRaoux added a commit that referenced this pull request Sep 8, 2023
avillega pushed a commit to avillega/llvm-project that referenced this pull request Sep 11, 2023
…lvm#65432)

On sm_90 some instructions now support i16x2 which allows hardware to
execute more efficiently add, min and max instructions.

In order to support that we need to make i16x2 a native type in the
backend. This does the necessary changes to make i16x2 a native type and
adds support for the instructions natively supporting i16x2.

This caused a negative test in nvptx slp to start passing. Changed the
test to a positive one as the IR is correctly vectorized.
@cheshire
Copy link
Contributor

This causes a nearly ~2x regression in many Triton kernels on Ampere, we'll post a reproducer.

thomasfaingnaert pushed a commit to thomasfaingnaert/llvm-project that referenced this pull request Jan 15, 2024
qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants