-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][gpu] Deprecate gpu::Serialization* passes. #65857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely.
@llvm/pr-subscribers-mlir ChangesDeprecate the NOTE:
However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including:
|
Thanks for working a cleaner pipeline! Just for my understanding, what keeps the sparse pipeline working without the SerializeToCubinPass that we can simply remove it? I will look into the tests Monday. I am pretty sure all tests passed with the proper build before, but I will have a look. |
GpuNVVMAttachTargetOptions nvvmTargetOptions; | ||
nvvmTargetOptions.triple = options.gpuTriple; | ||
nvvmTargetOptions.chip = options.gpuChip; | ||
nvvmTargetOptions.features = options.gpuFeatures; | ||
pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions)); | ||
pm.addPass(createGpuToLLVMConversionPass()); | ||
pm.addPass(createGpuModuleToBinaryPass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aartbik here's the change that keeps the Sparse pipeline working.
With respect to the tests, thank you! To be honest, I don't know if the problem is my test bed, it behaved erratically today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extra information (with the recent change from Phabricator to PR, I was a bit behind on all the changes). Again, your cleanup is much appreciated and I will look into this Monday. Worst case, we will fix it "after the fact", I don't want to hold up your process unnecessarily and this codepath is not very critical for us (yet ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the test, and the
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'
messages are indeed prompted, but seem harmless, since they just indicate that the method is called with sizeBytes == 0. I have several places in the GPU libgen path of the sparse compiler that bumps a zero size to a size of 4 just to avoid the message.
But other than that, the GPU path continues calling cuSPARSE and producing the correct result.
( ( 1, 39, 52, 0, 0, 0, 45, 51 ), ( 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 16, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 25, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 36, 0, 0, 0 ), ( 0, 117, 158, 0, 0, 0, 135, 144 ), ( 0, 156, 318, 0, 0, 0, 301, 324 ), ( 0, 208, 430, 0, 0, 0, 405, 436 ) )
20
So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known? And should we just do something about that in the library wrapper?
@grypp WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known?
Yes this is a known failure.
And should we just do something about that in the library wrapper?
Let's put an if statement that checks zero size in the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, now that the change is in, I actually see true errors (i.e. the computed result is no longer valid), so the cuMemAlloc error message was probably a red herring here. I will look into what went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for the nvptxcompiler path you need to pass a special CMake flag, so probably you're not using that path. So the issue must be a different one, could you send the GPU model and the toolkit version? Those messages are CUDA ones, so not much we can do. There's an option to debug: -debug-only=serialize-to-binary
you also need to -gpu-module-to-binary=opts="-v"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Thank you!
With a proper NVPTX built MLIR, I can debug this much better with these flags. Note that the sparse compiler has a CUDA codegen path (CUDA threads are generated for outermost loops of sparse code) and a CUDA libgen path (cuSPARSE, cuSPARSElt). For the former, my desktop GPU (Quadro P1000) used to be able to run the "codegen" path with the cubin pass under the given flags, even though sm_80 was a bit too high). By adjusting the test for my desktop, I see clean run again:
Tool invocation for module: "sparse_kernels"
ptxas -arch sm_50 /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx -o /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --opt-level 2
fatbinary -64 --image3=kind=elf,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --image3=kind=ptx,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx --create /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-290575.bin
-- JIT main !
( 87360, 89440, 91520, 93600, 95680, 97760, 99840, 101920, 104000, 106080, 108160, 110240, 112320, 114400, 116480, 118560, 120640, 122720, 124800, 126880, 128960, 131040, 133120, 135200, 137280, 139360, 141440, 143520, 145600, 147680, 149760, 151840, 153920, 156000, 158080, 160160, 162240, 164320, 166400, 168480, 170560, 172640, 174720, 176800, 178880, 180960, 183040, 185120, 187200, 189280, 191360, 193440, 195520, 197600, 199680, 201760, 203840, 205920, 208000, 210080, 212160, 214240, 216320, 218400 )
Note that we still have an issue in our blaze based set up getting the nvptxcompiler in place, but that is another story ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, I think I know what's happening: the tests use something like chip=sm_80
and are running on a device with a lower compute capability (sm_70
).
Background: The compute capability was never utilized at any point in --gpu-to-cubin
, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.
With this new method, there are 2 mechanisms:
format=bin
compiles only for the specified chip and if there's an arch miss-match then aCUDA_ERROR_NO_BINARY_FOR_GPU
is thrown.format=fatbin
generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for examplechpi=sm_50
can run onsm_80
, butchip=sm_80
cannot run onsm_50
and in this case one also getsCUDA_ERROR_NO_BINARY_FOR_GPU
.
NOTE: fatbin
is the default format.
The issue is this line sparse-matmul-lib.mlir#L5.
How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use sm_50
.
Please let me know, if the above works. I think it will, because when I ran the tests on an A100 those tests passed.
Wrt to blaze, I'm working on the JIT only path, so stay tuned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I changed exactly that line to
--sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_50 gpu-features=+ptx60"
to make the test pass. The "problem" is that I lazily put all the tests under the mlir_run_cuda_sm80_tests flag, since A100s are sort of the "minimum" GPUs we are interested in. But that gave the pass-before/fail-after behavior for me (which was not really a fail since I knew my desktop was not up-to-standard). So all is good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, so thankfully the error was not a bug.
Seems reasonable to me. Thank you for doing this.
Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document this in https://mlir.llvm.org/deprecation/ as well?
@grypp Yes I need help please, both tests there still refer to
Yes I'll change the message there from migration to deprecation. I won't push the patch today, I'll first do a PSA on discourse on Monday and then wait for the migration of the I'm even thinking adding a CMake option to enable the deprecated passes, letting users do a more controlled migration. |
The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.
No problem. Done in #65926 |
The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.
… deprecated passes in mlir-opt
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely. The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely. The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
Deprecate the
gpu-to-cubin
&gpu-to-hsaco
passes in favor of theTargetAttr
workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them inmlir-opt
. A future patch will remove these passes entirely.NOTE:
Integration/Dialect/SparseTensor/GPU/CUDA/sparse-gemm-lib.mlir
failed with:However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including:
CUDA_SM80_LT_TESTS
.2. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.