Skip to content

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mlir/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ else()
endif()
add_definitions(-DMLIR_ROCM_CONVERSIONS_ENABLED=${MLIR_ENABLE_ROCM_CONVERSIONS})

set(MLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION 0 CACHE BOOL "Enable deprecated GPU serialization passes")
set(MLIR_ENABLE_CUDA_RUNNER 0 CACHE BOOL "Enable building the mlir CUDA runner")
set(MLIR_ENABLE_ROCM_RUNNER 0 CACHE BOOL "Enable building the mlir ROCm runner")
set(MLIR_ENABLE_SPIRV_CPU_RUNNER 0 CACHE BOOL "Enable building the mlir SPIR-V cpu runner")
Expand Down
4 changes: 4 additions & 0 deletions mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -134,14 +134,17 @@ class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {

/// Register pass to serialize GPU kernel functions to a CUBIN binary
/// annotation.
LLVM_DEPRECATED("use Target attributes instead", "")
void registerGpuSerializeToCubinPass();

/// Register pass to serialize GPU kernel functions to a HSAco binary
/// annotation.
LLVM_DEPRECATED("use Target attributes instead", "")
void registerGpuSerializeToHsacoPass();

/// Create an instance of the GPU kernel function to CUBIN binary serialization
/// pass with optLevel (default level 2).
LLVM_DEPRECATED("use Target attributes instead", "")
std::unique_ptr<Pass> createGpuSerializeToCubinPass(StringRef triple,
StringRef chip,
StringRef features,
Expand All @@ -150,6 +153,7 @@ std::unique_ptr<Pass> createGpuSerializeToCubinPass(StringRef triple,

/// Create an instance of the GPU kernel function to HSAco binary serialization
/// pass.
LLVM_DEPRECATED("use Target attributes instead", "")
std::unique_ptr<Pass> createGpuSerializeToHsacoPass(StringRef triple,
StringRef arch,
StringRef features,
Expand Down
2 changes: 0 additions & 2 deletions mlir/include/mlir/InitAllPasses.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,6 @@ inline void registerAllPasses() {
bufferization::registerBufferizationPasses();
func::registerFuncPasses();
registerGPUPasses();
registerGpuSerializeToCubinPass();
registerGpuSerializeToHsacoPass();
registerLinalgPasses();
registerNVGPUPasses();
registerSparseTensorPasses();
Expand Down
8 changes: 0 additions & 8 deletions mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,3 @@ add_mlir_dialect_library(MLIRSparseTensorPipelines
MLIRVectorToLLVM
MLIRVectorTransforms
)

if(MLIR_ENABLE_CUDA_RUNNER)
# Enable gpu-to-cubin pass.
target_compile_definitions(obj.MLIRSparseTensorPipelines
PRIVATE
MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
)
endif()
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,13 @@ void mlir::sparse_tensor::buildSparseCompiler(

// Finalize GPU code generation.
if (gpuCodegen) {
#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(
options.gpuTriple, options.gpuChip, options.gpuFeatures));
#endif
GpuNVVMAttachTargetOptions nvvmTargetOptions;
nvvmTargetOptions.triple = options.gpuTriple;
nvvmTargetOptions.chip = options.gpuChip;
nvvmTargetOptions.features = options.gpuFeatures;
pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
pm.addPass(createGpuToLLVMConversionPass());
pm.addPass(createGpuModuleToBinaryPass());
Comment on lines +81 to +87
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aartbik here's the change that keeps the Sparse pipeline working.

With respect to the tests, thank you! To be honest, I don't know if the problem is my test bed, it behaved erratically today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the extra information (with the recent change from Phabricator to PR, I was a bit behind on all the changes). Again, your cleanup is much appreciated and I will look into this Monday. Worst case, we will fix it "after the fact", I don't want to hold up your process unnecessarily and this codepath is not very critical for us (yet ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the test, and the

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

messages are indeed prompted, but seem harmless, since they just indicate that the method is called with sizeBytes == 0. I have several places in the GPU libgen path of the sparse compiler that bumps a zero size to a size of 4 just to avoid the message.

But other than that, the GPU path continues calling cuSPARSE and producing the correct result.

( ( 1, 39, 52, 0, 0, 0, 45, 51 ), ( 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 16, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 25, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 36, 0, 0, 0 ), ( 0, 117, 158, 0, 0, 0, 135, 144 ), ( 0, 156, 318, 0, 0, 0, 301, 324 ), ( 0, 208, 430, 0, 0, 0, 405, 436 ) )
20

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known? And should we just do something about that in the library wrapper?

@grypp WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known?

Yes this is a known failure.

And should we just do something about that in the library wrapper?

Let's put an if statement that checks zero size in the runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, now that the change is in, I actually see true errors (i.e. the computed result is no longer valid), so the cuMemAlloc error message was probably a red herring here. I will look into what went wrong.

Copy link
Contributor Author

@fabianmcg fabianmcg Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for the nvptxcompiler path you need to pass a special CMake flag, so probably you're not using that path. So the issue must be a different one, could you send the GPU model and the toolkit version? Those messages are CUDA ones, so not much we can do. There's an option to debug: -debug-only=serialize-to-binary you also need to -gpu-module-to-binary=opts="-v".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thank you!

With a proper NVPTX built MLIR, I can debug this much better with these flags. Note that the sparse compiler has a CUDA codegen path (CUDA threads are generated for outermost loops of sparse code) and a CUDA libgen path (cuSPARSE, cuSPARSElt). For the former, my desktop GPU (Quadro P1000) used to be able to run the "codegen" path with the cubin pass under the given flags, even though sm_80 was a bit too high). By adjusting the test for my desktop, I see clean run again:

Tool invocation for module: "sparse_kernels"
ptxas -arch sm_50 /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx -o /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --opt-level 2
fatbinary -64 --image3=kind=elf,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --image3=kind=ptx,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx --create /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-290575.bin

-- JIT main !

( 87360, 89440, 91520, 93600, 95680, 97760, 99840, 101920, 104000, 106080, 108160, 110240, 112320, 114400, 116480, 118560, 120640, 122720, 124800, 126880, 128960, 131040, 133120, 135200, 137280, 139360, 141440, 143520, 145600, 147680, 149760, 151840, 153920, 156000, 158080, 160160, 162240, 164320, 166400, 168480, 170560, 172640, 174720, 176800, 178880, 180960, 183040, 185120, 187200, 189280, 191360, 193440, 195520, 197600, 199680, 201760, 203840, 205920, 208000, 210080, 212160, 214240, 216320, 218400 )

Note that we still have an issue in our blaze based set up getting the nvptxcompiler in place, but that is another story ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, I think I know what's happening: the tests use something like chip=sm_80 and are running on a device with a lower compute capability (sm_70).

Background: The compute capability was never utilized at any point in --gpu-to-cubin, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.

With this new method, there are 2 mechanisms:

  • format=bin compiles only for the specified chip and if there's an arch miss-match then a CUDA_ERROR_NO_BINARY_FOR_GPU is thrown.
  • format=fatbin generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for example chpi=sm_50 can run on sm_80, but chip=sm_80 cannot run on sm_50 and in this case one also gets CUDA_ERROR_NO_BINARY_FOR_GPU.

NOTE: fatbin is the default format.

The issue is this line sparse-matmul-lib.mlir#L5.

How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use sm_50.

Please let me know, if the above works. I think it will, because when I ran the tests on an A100 those tests passed.

Wrt to blaze, I'm working on the JIT only path, so stay tuned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I changed exactly that line to

--sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_50 gpu-features=+ptx60"

to make the test pass. The "problem" is that I lazily put all the tests under the mlir_run_cuda_sm80_tests flag, since A100s are sort of the "minimum" GPUs we are interested in. But that gave the pass-before/fail-after behavior for me (which was not really a fail since I knew my desktop was not up-to-standard). So all is good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, so thankfully the error was not a bug.

}

pm.addPass(createReconcileUnrealizedCastsPass());
Expand Down
2 changes: 0 additions & 2 deletions mlir/test/Conversion/GPUToCUDA/lit.local.cfg

This file was deleted.

25 changes: 0 additions & 25 deletions mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir

This file was deleted.

2 changes: 0 additions & 2 deletions mlir/test/Conversion/GPUToROCm/lit.local.cfg

This file was deleted.

25 changes: 0 additions & 25 deletions mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir

This file was deleted.

11 changes: 0 additions & 11 deletions mlir/test/lib/Dialect/GPU/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ set(LIBS
)

add_mlir_library(MLIRGPUTestPasses
TestConvertGPUKernelToCubin.cpp
TestConvertGPUKernelToHsaco.cpp
TestGpuMemoryPromotion.cpp
TestGpuRewrite.cpp
TestLowerToNVVM.cpp
Expand All @@ -43,12 +41,3 @@ add_mlir_library(MLIRGPUTestPasses
${LIBS}
)

# This is how it is defined in mlir/lib/Dialect/GPU/CMakeLists.txt
# We probably want something better project-wide
if(MLIR_ENABLE_CUDA_RUNNER)
# Enable gpu-to-cubin pass.
target_compile_definitions(MLIRGPUTestPasses
PRIVATE
MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
)
endif()
73 changes: 0 additions & 73 deletions mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp

This file was deleted.

72 changes: 0 additions & 72 deletions mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp

This file was deleted.

8 changes: 8 additions & 0 deletions mlir/tools/mlir-opt/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,11 @@ llvm_update_compile_flags(mlir-opt)

mlir_check_all_link_libraries(mlir-opt)
export_executable_symbols_for_plugins(mlir-opt)

if(MLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION)
# Enable deprecated serialization passes.
target_compile_definitions(mlir-opt
PRIVATE
MLIR_DEPRECATED_GPU_SERIALIZATION_ENABLE=1
)
endif()
10 changes: 4 additions & 6 deletions mlir/tools/mlir-opt/mlir-opt.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,6 @@ void registerTestCallGraphPass();
void registerTestCfAssertPass();
void registerTestConstantFold();
void registerTestControlFlowSink();
void registerTestGpuSerializeToCubinPass();
void registerTestGpuSerializeToHsacoPass();
void registerTestDataLayoutPropagation();
void registerTestDataLayoutQuery();
void registerTestDeadCodeAnalysisPass();
Expand Down Expand Up @@ -204,11 +202,7 @@ void registerTestPasses() {
mlir::test::registerTestDiagnosticsPass();
mlir::test::registerTestDialectConversionPasses();
#if MLIR_CUDA_CONVERSIONS_ENABLED
mlir::test::registerTestGpuSerializeToCubinPass();
mlir::test::registerTestLowerToNVVM();
#endif
#if MLIR_ROCM_CONVERSIONS_ENABLED
mlir::test::registerTestGpuSerializeToHsacoPass();
#endif
mlir::test::registerTestDecomposeCallGraphTypes();
mlir::test::registerTestDataLayoutPropagation();
Expand Down Expand Up @@ -270,6 +264,10 @@ void registerTestPasses() {

int main(int argc, char **argv) {
registerAllPasses();
#if MLIR_DEPRECATED_GPU_SERIALIZATION_ENABLE == 1
registerGpuSerializeToCubinPass();
registerGpuSerializeToHsacoPass();
#endif
#ifdef MLIR_INCLUDE_TESTS
registerTestPasses();
#endif
Expand Down