-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][gpu] Deprecate gpu::Serialization* passes. #65857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
25 changes: 0 additions & 25 deletions
25
mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
25 changes: 0 additions & 25 deletions
25
mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aartbik here's the change that keeps the Sparse pipeline working.
With respect to the tests, thank you! To be honest, I don't know if the problem is my test bed, it behaved erratically today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extra information (with the recent change from Phabricator to PR, I was a bit behind on all the changes). Again, your cleanup is much appreciated and I will look into this Monday. Worst case, we will fix it "after the fact", I don't want to hold up your process unnecessarily and this codepath is not very critical for us (yet ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the test, and the
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'
messages are indeed prompted, but seem harmless, since they just indicate that the method is called with sizeBytes == 0. I have several places in the GPU libgen path of the sparse compiler that bumps a zero size to a size of 4 just to avoid the message.
But other than that, the GPU path continues calling cuSPARSE and producing the correct result.
( ( 1, 39, 52, 0, 0, 0, 45, 51 ), ( 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 16, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 25, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 36, 0, 0, 0 ), ( 0, 117, 158, 0, 0, 0, 135, 144 ), ( 0, 156, 318, 0, 0, 0, 301, 324 ), ( 0, 208, 430, 0, 0, 0, 405, 436 ) )
20
So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known? And should we just do something about that in the library wrapper?
@grypp WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a known failure.
Let's put an if statement that checks zero size in the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, now that the change is in, I actually see true errors (i.e. the computed result is no longer valid), so the cuMemAlloc error message was probably a red herring here. I will look into what went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the nvptxcompiler path you need to pass a special CMake flag, so probably you're not using that path. So the issue must be a different one, could you send the GPU model and the toolkit version? Those messages are CUDA ones, so not much we can do. There's an option to debug:
-debug-only=serialize-to-binary
you also need to-gpu-module-to-binary=opts="-v"
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Thank you!
With a proper NVPTX built MLIR, I can debug this much better with these flags. Note that the sparse compiler has a CUDA codegen path (CUDA threads are generated for outermost loops of sparse code) and a CUDA libgen path (cuSPARSE, cuSPARSElt). For the former, my desktop GPU (Quadro P1000) used to be able to run the "codegen" path with the cubin pass under the given flags, even though sm_80 was a bit too high). By adjusting the test for my desktop, I see clean run again:
Tool invocation for module: "sparse_kernels"
ptxas -arch sm_50 /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx -o /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --opt-level 2
fatbinary -64 --image3=kind=elf,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --image3=kind=ptx,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx --create /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-290575.bin
-- JIT main !
( 87360, 89440, 91520, 93600, 95680, 97760, 99840, 101920, 104000, 106080, 108160, 110240, 112320, 114400, 116480, 118560, 120640, 122720, 124800, 126880, 128960, 131040, 133120, 135200, 137280, 139360, 141440, 143520, 145600, 147680, 149760, 151840, 153920, 156000, 158080, 160160, 162240, 164320, 166400, 168480, 170560, 172640, 174720, 176800, 178880, 180960, 183040, 185120, 187200, 189280, 191360, 193440, 195520, 197600, 199680, 201760, 203840, 205920, 208000, 210080, 212160, 214240, 216320, 218400 )
Note that we still have an issue in our blaze based set up getting the nvptxcompiler in place, but that is another story ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, I think I know what's happening: the tests use something like
chip=sm_80
and are running on a device with a lower compute capability (sm_70
).Background: The compute capability was never utilized at any point in
--gpu-to-cubin
, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.With this new method, there are 2 mechanisms:
format=bin
compiles only for the specified chip and if there's an arch miss-match then aCUDA_ERROR_NO_BINARY_FOR_GPU
is thrown.format=fatbin
generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for examplechpi=sm_50
can run onsm_80
, butchip=sm_80
cannot run onsm_50
and in this case one also getsCUDA_ERROR_NO_BINARY_FOR_GPU
.NOTE:
fatbin
is the default format.The issue is this line sparse-matmul-lib.mlir#L5.
How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use
sm_50
.Please let me know, if the above works. I think it will, because when I ran the tests on an A100 those tests passed.
Wrt to blaze, I'm working on the JIT only path, so stay tuned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I changed exactly that line to
--sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_50 gpu-features=+ptx60"
to make the test pass. The "problem" is that I lazily put all the tests under the mlir_run_cuda_sm80_tests flag, since A100s are sort of the "minimum" GPUs we are interested in. But that gave the pass-before/fail-after behavior for me (which was not really a fail since I knew my desktop was not up-to-standard). So all is good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, so thankfully the error was not a bug.