[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

fabianmcg · 2023-09-09T18:30:53Z

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in mlir-opt. A future patch will remove these passes entirely.

NOTE:

When testing on an NVIDIA A100 the test Integration/Dialect/SparseTensor/GPU/CUDA/sparse-gemm-lib.mlir failed with:

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including: CUDA_SM80_LT_TESTS.
2. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely.

llvmbot · 2023-09-09T18:32:01Z

@llvm/pr-subscribers-mlir

Changes

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in mlir-opt. A future patch will remove these passes entirely.

NOTE:

When testing on an NVIDIA A100 the test Integration/Dialect/SparseTensor/GPU/CUDA/sparse-gemm-lib.mlir failed with:

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

However, the test failed even without the switch to the new workflow, if someone else could test and verify it would be appreciate it. All other tests succeeded including: `CUDA_SM80_LT_TESTS`.
2. The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Full diff: https://github.com/llvm/llvm-project/pull/65857.diff

12 Files Affected:

(modified) mlir/include/mlir/Dialect/GPU/Transforms/Passes.h (+4)
(modified) mlir/include/mlir/InitAllPasses.h (-2)
(modified) mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt (-8)
(modified) mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp (+6-4)
(removed) mlir/test/Conversion/GPUToCUDA/lit.local.cfg (-2)
(removed) mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir (-25)
(removed) mlir/test/Conversion/GPUToROCm/lit.local.cfg (-2)
(removed) mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir (-25)
(modified) mlir/test/lib/Dialect/GPU/CMakeLists.txt (-11)
(removed) mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp (-73)
(removed) mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp (-72)
(modified) mlir/tools/mlir-opt/mlir-opt.cpp (-6)

diff --git a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
index 033e8755501f967..2a891a7d24f809a 100644
--- a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
+++ b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
@@ -134,14 +134,17 @@ class SerializeToBlobPass : public OperationPass {
 
 /// Register pass to serialize GPU kernel functions to a CUBIN binary
 /// annotation.
+LLVM_DEPRECATED("use Target attributes instead", "")
 void registerGpuSerializeToCubinPass();
 
 /// Register pass to serialize GPU kernel functions to a HSAco binary
 /// annotation.
+LLVM_DEPRECATED("use Target attributes instead", "")
 void registerGpuSerializeToHsacoPass();
 
 /// Create an instance of the GPU kernel function to CUBIN binary serialization
 /// pass with optLevel (default level 2).
+LLVM_DEPRECATED("use Target attributes instead", "")
 std::unique_ptr createGpuSerializeToCubinPass(StringRef triple,
                                                     StringRef chip,
                                                     StringRef features,
@@ -150,6 +153,7 @@ std::unique_ptr createGpuSerializeToCubinPass(StringRef triple,
 
 /// Create an instance of the GPU kernel function to HSAco binary serialization
 /// pass.
+LLVM_DEPRECATED("use Target attributes instead", "")
 std::unique_ptr createGpuSerializeToHsacoPass(StringRef triple,
                                                     StringRef arch,
                                                     StringRef features,
diff --git a/mlir/include/mlir/InitAllPasses.h b/mlir/include/mlir/InitAllPasses.h
index 8f3f92ae43145d1..f7271737c66d1cb 100644
--- a/mlir/include/mlir/InitAllPasses.h
+++ b/mlir/include/mlir/InitAllPasses.h
@@ -65,8 +65,6 @@ inline void registerAllPasses() {
   bufferization::registerBufferizationPasses();
   func::registerFuncPasses();
   registerGPUPasses();
-  registerGpuSerializeToCubinPass();
-  registerGpuSerializeToHsacoPass();
   registerLinalgPasses();
   registerNVGPUPasses();
   registerSparseTensorPasses();
diff --git a/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt b/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
index 3cf530abd744e8e..234a0d82babef67 100644
--- a/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
+++ b/mlir/lib/Dialect/SparseTensor/Pipelines/CMakeLists.txt
@@ -27,11 +27,3 @@ add_mlir_dialect_library(MLIRSparseTensorPipelines
   MLIRVectorToLLVM
   MLIRVectorTransforms
 )
-
-if(MLIR_ENABLE_CUDA_RUNNER)
-  # Enable gpu-to-cubin pass.
-  target_compile_definitions(obj.MLIRSparseTensorPipelines
-    PRIVATE
-    MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
-  )
-endif()
diff --git a/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp b/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
index 24c4c4c43a93dea..37f9e09d34c04e7 100644
--- a/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
+++ b/mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp
@@ -78,11 +78,13 @@ void mlir::sparse_tensor::buildSparseCompiler(
 
   // Finalize GPU code generation.
   if (gpuCodegen) {
-#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
-    pm.addNestedPass(createGpuSerializeToCubinPass(
-        options.gpuTriple, options.gpuChip, options.gpuFeatures));
-#endif
+    GpuNVVMAttachTargetOptions nvvmTargetOptions;
+    nvvmTargetOptions.triple = options.gpuTriple;
+    nvvmTargetOptions.chip = options.gpuChip;
+    nvvmTargetOptions.features = options.gpuFeatures;
+    pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
     pm.addPass(createGpuToLLVMConversionPass());
+    pm.addPass(createGpuModuleToBinaryPass());
   }
 
   pm.addPass(createReconcileUnrealizedCastsPass());
diff --git a/mlir/test/Conversion/GPUToCUDA/lit.local.cfg b/mlir/test/Conversion/GPUToCUDA/lit.local.cfg
deleted file mode 100644
index bc470ccc5733a96..000000000000000
--- a/mlir/test/Conversion/GPUToCUDA/lit.local.cfg
+++ /dev/null
@@ -1,2 +0,0 @@
-if not config.run_cuda_tests:
-    config.unsupported = True
diff --git a/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir b/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
deleted file mode 100644
index 0a2ac552a7c6db1..000000000000000
--- a/mlir/test/Conversion/GPUToCUDA/lower-nvvm-kernel-to-cubin.mlir
+++ /dev/null
@@ -1,25 +0,0 @@
-// RUN: mlir-opt %s --test-gpu-to-cubin | FileCheck %s
-
-// CHECK: gpu.module @foo attributes {gpu.binary = "CUBIN"}
-gpu.module @foo {
-  llvm.func @kernel(%arg0 : f32, %arg1 : !llvm.ptr)
-    // CHECK: attributes  {gpu.kernel}
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
-
-// CHECK: gpu.module @bar attributes {gpu.binary = "CUBIN"}
-gpu.module @bar {
-  // CHECK: func @kernel_a
-  llvm.func @kernel_a()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-
-  // CHECK: func @kernel_b
-  llvm.func @kernel_b()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
diff --git a/mlir/test/Conversion/GPUToROCm/lit.local.cfg b/mlir/test/Conversion/GPUToROCm/lit.local.cfg
deleted file mode 100644
index 2f5cc9f3bad9737..000000000000000
--- a/mlir/test/Conversion/GPUToROCm/lit.local.cfg
+++ /dev/null
@@ -1,2 +0,0 @@
-if not config.run_rocm_tests:
-    config.unsupported = True
diff --git a/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir b/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
deleted file mode 100644
index 8e27de4b60de741..000000000000000
--- a/mlir/test/Conversion/GPUToROCm/lower-rocdl-kernel-to-hsaco.mlir
+++ /dev/null
@@ -1,25 +0,0 @@
-// RUN: mlir-opt %s --test-gpu-to-hsaco | FileCheck %s
-
-// CHECK: gpu.module @foo attributes {gpu.binary = "HSACO"}
-gpu.module @foo {
-  llvm.func @kernel(%arg0 : f32, %arg1 : !llvm.ptr)
-    // CHECK: attributes  {gpu.kernel}
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
-
-// CHECK: gpu.module @bar attributes {gpu.binary = "HSACO"}
-gpu.module @bar {
-  // CHECK: func @kernel_a
-  llvm.func @kernel_a()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-
-  // CHECK: func @kernel_b
-  llvm.func @kernel_b()
-    attributes  { gpu.kernel } {
-    llvm.return
-  }
-}
diff --git a/mlir/test/lib/Dialect/GPU/CMakeLists.txt b/mlir/test/lib/Dialect/GPU/CMakeLists.txt
index ac96229e80a077e..80edd04b691a571 100644
--- a/mlir/test/lib/Dialect/GPU/CMakeLists.txt
+++ b/mlir/test/lib/Dialect/GPU/CMakeLists.txt
@@ -31,8 +31,6 @@ set(LIBS
   )
 
 add_mlir_library(MLIRGPUTestPasses
-  TestConvertGPUKernelToCubin.cpp
-  TestConvertGPUKernelToHsaco.cpp
   TestGpuMemoryPromotion.cpp
   TestGpuRewrite.cpp
   TestLowerToNVVM.cpp
@@ -43,12 +41,3 @@ add_mlir_library(MLIRGPUTestPasses
   ${LIBS}
   )
 
-# This is how it is defined in mlir/lib/Dialect/GPU/CMakeLists.txt
-# We probably want something better project-wide
-if(MLIR_ENABLE_CUDA_RUNNER)
-  # Enable gpu-to-cubin pass.
-  target_compile_definitions(MLIRGPUTestPasses
-    PRIVATE
-    MLIR_GPU_TO_CUBIN_PASS_ENABLE=1
-  )
-endif()
diff --git a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp b/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp
deleted file mode 100644
index 1c442b0147c8b30..000000000000000
--- a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp
+++ /dev/null
@@ -1,73 +0,0 @@
-//===- TestConvertGPUKernelToCubin.cpp - Test gpu kernel cubin lowering ---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "mlir/Dialect/GPU/Transforms/Passes.h"
-#include "mlir/Pass/Pass.h"
-#include "mlir/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.h"
-#include "mlir/Target/LLVMIR/Export.h"
-#include "llvm/Support/TargetSelect.h"
-
-using namespace mlir;
-
-#if MLIR_CUDA_CONVERSIONS_ENABLED
-namespace {
-class TestSerializeToCubinPass
-    : public PassWrapper {
-public:
-  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestSerializeToCubinPass)
-
-  StringRef getArgument() const final { return "test-gpu-to-cubin"; }
-  StringRef getDescription() const final {
-    return "Lower GPU kernel function to CUBIN binary annotations";
-  }
-  TestSerializeToCubinPass();
-
-private:
-  void getDependentDialects(DialectRegistry ®istry) const override;
-
-  // Serializes PTX to CUBIN.
-  std::unique_ptr>
-  serializeISA(const std::string &isa) override;
-};
-} // namespace
-
-TestSerializeToCubinPass::TestSerializeToCubinPass() {
-  this->triple = "nvptx64-nvidia-cuda";
-  this->chip = "sm_35";
-  this->features = "+ptx60";
-}
-
-void TestSerializeToCubinPass::getDependentDialects(
-    DialectRegistry ®istry) const {
-  registerNVVMDialectTranslation(registry);
-  gpu::SerializeToBlobPass::getDependentDialects(registry);
-}
-
-std::unique_ptr>
-TestSerializeToCubinPass::serializeISA(const std::string &) {
-  std::string data = "CUBIN";
-  return std::make_unique>(data.begin(), data.end());
-}
-
-namespace mlir {
-namespace test {
-// Register test pass to serialize GPU module to a CUBIN binary annotation.
-void registerTestGpuSerializeToCubinPass() {
-  PassRegistration([] {
-    // Initialize LLVM NVPTX backend.
-    LLVMInitializeNVPTXTarget();
-    LLVMInitializeNVPTXTargetInfo();
-    LLVMInitializeNVPTXTargetMC();
-    LLVMInitializeNVPTXAsmPrinter();
-
-    return std::make_unique();
-  });
-}
-} // namespace test
-} // namespace mlir
-#endif // MLIR_CUDA_CONVERSIONS_ENABLED
diff --git a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp b/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp
deleted file mode 100644
index c204e86632ac920..000000000000000
--- a/mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToHsaco.cpp
+++ /dev/null
@@ -1,72 +0,0 @@
-//===- TestConvertGPUKernelToHsaco.cpp - Test gpu kernel hsaco lowering ---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "mlir/Dialect/GPU/Transforms/Passes.h"
-#include "mlir/Pass/Pass.h"
-#include "mlir/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.h"
-#include "mlir/Target/LLVMIR/Export.h"
-#include "llvm/Support/TargetSelect.h"
-
-using namespace mlir;
-
-#if MLIR_ROCM_CONVERSIONS_ENABLED
-namespace {
-class TestSerializeToHsacoPass
-    : public PassWrapper {
-public:
-  MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestSerializeToHsacoPass)
-
-  StringRef getArgument() const final { return "test-gpu-to-hsaco"; }
-  StringRef getDescription() const final {
-    return "Lower GPU kernel function to HSAco binary annotations";
-  }
-  TestSerializeToHsacoPass();
-
-private:
-  void getDependentDialects(DialectRegistry ®istry) const override;
-
-  // Serializes ROCDL IR to HSACO.
-  std::unique_ptr>
-  serializeISA(const std::string &isa) override;
-};
-} // namespace
-
-TestSerializeToHsacoPass::TestSerializeToHsacoPass() {
-  this->triple = "amdgcn-amd-amdhsa";
-  this->chip = "gfx900";
-}
-
-void TestSerializeToHsacoPass::getDependentDialects(
-    DialectRegistry ®istry) const {
-  registerROCDLDialectTranslation(registry);
-  gpu::SerializeToBlobPass::getDependentDialects(registry);
-}
-
-std::unique_ptr>
-TestSerializeToHsacoPass::serializeISA(const std::string &) {
-  std::string data = "HSACO";
-  return std::make_unique>(data.begin(), data.end());
-}
-
-namespace mlir {
-namespace test {
-// Register test pass to serialize GPU module to a HSAco binary annotation.
-void registerTestGpuSerializeToHsacoPass() {
-  PassRegistration([] {
-    // Initialize LLVM AMDGPU backend.
-    LLVMInitializeAMDGPUTarget();
-    LLVMInitializeAMDGPUTargetInfo();
-    LLVMInitializeAMDGPUTargetMC();
-    LLVMInitializeAMDGPUAsmPrinter();
-
-    return std::make_unique();
-  });
-}
-} // namespace test
-} // namespace mlir
-#endif // MLIR_ROCM_CONVERSIONS_ENABLED
diff --git a/mlir/tools/mlir-opt/mlir-opt.cpp b/mlir/tools/mlir-opt/mlir-opt.cpp
index a8aeffec1ae72d0..22eca9bcff6ff27 100644
--- a/mlir/tools/mlir-opt/mlir-opt.cpp
+++ b/mlir/tools/mlir-opt/mlir-opt.cpp
@@ -80,8 +80,6 @@ void registerTestCallGraphPass();
 void registerTestCfAssertPass();
 void registerTestConstantFold();
 void registerTestControlFlowSink();
-void registerTestGpuSerializeToCubinPass();
-void registerTestGpuSerializeToHsacoPass();
 void registerTestDataLayoutPropagation();
 void registerTestDataLayoutQuery();
 void registerTestDeadCodeAnalysisPass();
@@ -204,11 +202,7 @@ void registerTestPasses() {
   mlir::test::registerTestDiagnosticsPass();
   mlir::test::registerTestDialectConversionPasses();
 #if MLIR_CUDA_CONVERSIONS_ENABLED
-  mlir::test::registerTestGpuSerializeToCubinPass();
   mlir::test::registerTestLowerToNVVM();
-#endif
-#if MLIR_ROCM_CONVERSIONS_ENABLED
-  mlir::test::registerTestGpuSerializeToHsacoPass();
 #endif
   mlir::test::registerTestDecomposeCallGraphTypes();
   mlir::test::registerTestDataLayoutPropagation();

aartbik · 2023-09-09T18:42:31Z

Thanks for working a cleaner pipeline! Just for my understanding, what keeps the sparse pipeline working without the SerializeToCubinPass that we can simply remove it?

I will look into the tests Monday. I am pretty sure all tests passed with the proper build before, but I will have a look.

fabianmcg · 2023-09-09T18:58:26Z

mlir/lib/Dialect/SparseTensor/Pipelines/SparseTensorPipelines.cpp

+    GpuNVVMAttachTargetOptions nvvmTargetOptions;
+    nvvmTargetOptions.triple = options.gpuTriple;
+    nvvmTargetOptions.chip = options.gpuChip;
+    nvvmTargetOptions.features = options.gpuFeatures;
+    pm.addPass(createGpuNVVMAttachTarget(nvvmTargetOptions));
    pm.addPass(createGpuToLLVMConversionPass());
+    pm.addPass(createGpuModuleToBinaryPass());


@aartbik here's the change that keeps the Sparse pipeline working.

With respect to the tests, thank you! To be honest, I don't know if the problem is my test bed, it behaved erratically today.

Thank you for the extra information (with the recent change from Phabricator to PR, I was a bit behind on all the changes). Again, your cleanup is much appreciated and I will look into this Monday. Worst case, we will fix it "after the fact", I don't want to hold up your process unnecessarily and this codepath is not very critical for us (yet ;-)

I looked into the test, and the

'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'

messages are indeed prompted, but seem harmless, since they just indicate that the method is called with sizeBytes == 0. I have several places in the GPU libgen path of the sparse compiler that bumps a zero size to a size of 4 just to avoid the message.

But other than that, the GPU path continues calling cuSPARSE and producing the correct result.

( ( 1, 39, 52, 0, 0, 0, 45, 51 ), ( 0, 0, 0, 0, 0, 0, 0, 0 ), ( 0, 0, 16, 0, 0, 0, 0, 0 ), ( 0, 0, 0, 25, 0, 0, 0, 0 ), ( 0, 0, 0, 0, 36, 0, 0, 0 ), ( 0, 117, 158, 0, 0, 0, 135, 144 ), ( 0, 156, 318, 0, 0, 0, 301, 324 ), ( 0, 208, 430, 0, 0, 0, 405, 436 ) )
20

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known? And should we just do something about that in the library wrapper?

@grypp WDYT?

So, is the issue with cuMemAlloc(&ptr, sizeBytes) for zero size known?

Yes this is a known failure.

And should we just do something about that in the library wrapper?

Let's put an if statement that checks zero size in the runtime.

Also, now that the change is in, I actually see true errors (i.e. the computed result is no longer valid), so the cuMemAlloc error message was probably a red herring here. I will look into what went wrong.

No, for the nvptxcompiler path you need to pass a special CMake flag, so probably you're not using that path. So the issue must be a different one, could you send the GPU model and the toolkit version? Those messages are CUDA ones, so not much we can do. There's an option to debug: -debug-only=serialize-to-binary you also need to -gpu-module-to-binary=opts="-v".

Very nice! Thank you!

With a proper NVPTX built MLIR, I can debug this much better with these flags. Note that the sparse compiler has a CUDA codegen path (CUDA threads are generated for outermost loops of sparse code) and a CUDA libgen path (cuSPARSE, cuSPARSElt). For the former, my desktop GPU (Quadro P1000) used to be able to run the "codegen" path with the cubin pass under the given flags, even though sm_80 was a bit too high). By adjusting the test for my desktop, I see clean run again:

Tool invocation for module: "sparse_kernels"
ptxas -arch sm_50 /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx -o /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --opt-level 2
fatbinary -64 --image3=kind=elf,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx.cubin --image3=kind=ptx,sm=50,file=/tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-05ebde.ptx --create /tmp/mlir-sparse_kernels-nvptx64-nvidia-cuda-sm_50-290575.bin

-- JIT main !

( 87360, 89440, 91520, 93600, 95680, 97760, 99840, 101920, 104000, 106080, 108160, 110240, 112320, 114400, 116480, 118560, 120640, 122720, 124800, 126880, 128960, 131040, 133120, 135200, 137280, 139360, 141440, 143520, 145600, 147680, 149760, 151840, 153920, 156000, 158080, 160160, 162240, 164320, 166400, 168480, 170560, 172640, 174720, 176800, 178880, 180960, 183040, 185120, 187200, 189280, 191360, 193440, 195520, 197600, 199680, 201760, 203840, 205920, 208000, 210080, 212160, 214240, 216320, 218400 )

Note that we still have an issue in our blaze based set up getting the nvptxcompiler in place, but that is another story ;-)

Oh, I see, I think I know what's happening: the tests use something like chip=sm_80 and are running on a device with a lower compute capability (sm_70).

Background: The compute capability was never utilized at any point in --gpu-to-cubin, and it always compiled the code for the arch found at compile time, that's why the tests never gave issues before.

With this new method, there are 2 mechanisms:

format=bin compiles only for the specified chip and if there's an arch miss-match then a CUDA_ERROR_NO_BINARY_FOR_GPU is thrown.

format=fatbin generates a binary for the specified chip and also embeds the PTX, allowing the driver to JIT the code. However, something I found recently is that the driver can only JIT to a higher CC, so for example chpi=sm_50 can run on sm_80, but chip=sm_80 cannot run on sm_50 and in this case one also gets CUDA_ERROR_NO_BINARY_FOR_GPU.

NOTE: fatbin is the default format.

The issue is this line sparse-matmul-lib.mlir#L5.

How to solve? Use the lowest arch required for the test, that's why most integration tests in trunk use sm_50.

Please let me know, if the above works. I think it will, because when I ran the tests on an A100 those tests passed.

Wrt to blaze, I'm working on the JIT only path, so stay tuned.

Yeah, I changed exactly that line to

--sparse-compiler="enable-runtime-library=false parallelization-strategy=dense-outer-loop gpu-triple=nvptx64-nvidia-cuda gpu-chip=sm_50 gpu-features=+ptx60"

to make the test pass. The "problem" is that I lazily put all the tests under the mlir_run_cuda_sm80_tests flag, since A100s are sort of the "minimum" GPUs we are interested in. But that gave the pass-before/fail-after behavior for me (which was not really a fail since I knew my desktop was not up-to-standard). So all is good!

Great, so thankfully the error was not a bug.

grypp · 2023-09-09T20:36:11Z

Deprecate the gpu-to-cubin & gpu-to-hsaco passes in favor of the TargetAttr workflow.

Seems reasonable to me. Thank you for doing this.
I no longer use the gpu-to-cubin pass since you've implemented the new workflow.

The SM_90 integration tests still need to be ported into the new workflow, so this patch is dependent on that porting.

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.

FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

joker-eph

Can you document this in https://mlir.llvm.org/deprecation/ as well?

fabianmcg · 2023-09-09T23:08:30Z

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.

FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

@grypp Yes I need help please, both tests there still refer to gpu-to-cubin and I don't have access to a H100. To dump the ptx see this test: dump-ptx.mlir

Can you document this in https://mlir.llvm.org/deprecation/ as well?

Yes I'll change the message there from migration to deprecation. I won't push the patch today, I'll first do a PSA on discourse on Monday and then wait for the migration of the sm_90 tests.

I'm even thinking adding a CMake option to enable the deprecated passes, letting users do a more controlled migration.

The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.

grypp · 2023-09-11T06:57:49Z

Let me know if you need help here. It seems like removing the 'convert-gpu-to-nvvm' addind TargetAttr would be enough. We are already using "TargetAttr" in one test.
FWIW, I'm about to land for 4 more tests. I will make sure to use the new workflow.

@grypp Yes I need help please, both tests there still refer to gpu-to-cubin and I don't have access to a H100. To dump the ptx see this test: dump-ptx.mlir

No problem. Done in #65926

The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.

… deprecated passes in mlir-opt

Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely. The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.

The 'TargetAttr' workflow was recently introduced to serialization for 'MLIR->LLVM->PTX'. llvm#65857 removes previous passes (gpu::Serialization* passes) because they are duplicates. This PR removes the use of gpu::Serialization* passes in SM_90 integration tests, and enables the 'TargetAttr' workflow. It also moves the transform dialect specific test to a new folder.

Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the `TargetAttr` workflow. This patch removes remaining upstream uses of the aforementioned passes, including the option to use them in `mlir-opt`. A future patch will remove these passes entirely. The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.

fabianmcg requested review from joker-eph, grypp and aartbik September 9, 2023 18:30

fabianmcg requested review from a team as code owners September 9, 2023 18:30

llvmbot added mlir:core MLIR Core Infrastructure mlir:gpu mlir:sparse Sparse compiler in MLIR mlir labels Sep 9, 2023

fabianmcg removed request for a team September 9, 2023 18:31

fabianmcg changed the title ~~[mlir][gpu] Deprecate gpu::Serialziation* passes.~~ [mlir][gpu] Deprecate gpu::Serialization* passes. Sep 9, 2023

fabianmcg commented Sep 9, 2023

View reviewed changes

joker-eph approved these changes Sep 9, 2023

View reviewed changes

grypp mentioned this pull request Sep 11, 2023

[MLIR] Make SM_90 integration tests use TargetAttr #65926

Merged

grypp approved these changes Sep 11, 2023

View reviewed changes

Add the MLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION option to enable the…

2c074e2

… deprecated passes in mlir-opt

fabianmcg merged commit 1828deb into llvm:main Sep 11, 2023

fabianmcg deleted the gpu-serialization-deprecation branch September 12, 2023 12:32

vzakhari mentioned this pull request Sep 12, 2023

internap proc trampolines #66156

Closed

fabianmcg mentioned this pull request Sep 13, 2023

[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path #66220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

fabianmcg commented Sep 9, 2023 •

edited

Loading

llvmbot commented Sep 9, 2023

aartbik commented Sep 9, 2023

fabianmcg Sep 9, 2023

aartbik Sep 9, 2023

aartbik Sep 11, 2023

grypp Sep 12, 2023

aartbik Sep 12, 2023

fabianmcg Sep 12, 2023 •

edited

Loading

aartbik Sep 13, 2023

fabianmcg Sep 13, 2023

aartbik Sep 13, 2023

fabianmcg Sep 13, 2023

grypp commented Sep 9, 2023

joker-eph left a comment

fabianmcg commented Sep 9, 2023 •

edited

Loading

grypp commented Sep 11, 2023

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

[mlir][gpu] Deprecate gpu::Serialization* passes. #65857

Conversation

fabianmcg commented Sep 9, 2023 • edited Loading

llvmbot commented Sep 9, 2023

aartbik commented Sep 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianmcg Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grypp commented Sep 9, 2023

joker-eph left a comment

Choose a reason for hiding this comment

fabianmcg commented Sep 9, 2023 • edited Loading

grypp commented Sep 11, 2023

fabianmcg commented Sep 9, 2023 •

edited

Loading

fabianmcg Sep 12, 2023 •

edited

Loading

fabianmcg commented Sep 9, 2023 •

edited

Loading