Skip to content

[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM #7252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krasznaa opened this issue Nov 2, 2022 · 40 comments
Closed

[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM #7252

krasznaa opened this issue Nov 2, 2022 · 40 comments
Assignees
Labels
bug Something isn't working confirmed hip Issues related to execution on HIP backend.

Comments

@krasznaa
Copy link
Contributor

krasznaa commented Nov 2, 2022

Describe the bug

While testing our project (https://github.com/acts-project/vecmem) with the HIP backend of the compiler, I found a failure in the unit tests with atomic operations.

[bash][pcadp04]:build > ./bin/vecmem_test_sycl
Running main() from /afs/cern.ch/user/k/krasznaa/work/projects/vecmem/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 19 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from sycl_containers_test
[ RUN      ] sycl_containers_test.shared_memory
[       OK ] sycl_containers_test.shared_memory (8 ms)
[ RUN      ] sycl_containers_test.device_memory
[       OK ] sycl_containers_test.device_memory (19 ms)
[ RUN      ] sycl_containers_test.atomic_memory
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
  value
    Which is: 800
  ITERATIONS * 4
    Which is: 400
...
/afs/cern.ch/user/k/krasznaa/work/projects/vecmem/vecmem/tests/sycl/test_sycl_containers.sycl:210: Failure
Expected equality of these values:
  value
    Which is: 800
  ITERATIONS * 4
    Which is: 400
[  FAILED  ] sycl_containers_test.atomic_memory (7 ms)
[ RUN      ] sycl_containers_test.extendable_memory
[       OK ] sycl_containers_test.extendable_memory (13 ms)
[ RUN      ] sycl_containers_test.array_memory
[       OK ] sycl_containers_test.array_memory (4 ms)
[----------] 5 tests from sycl_containers_test (54 ms total)

[----------] 5 tests from sycl_jagged_containers_test
[ RUN      ] sycl_jagged_containers_test.mutate_in_kernel
[       OK ] sycl_jagged_containers_test.mutate_in_kernel (9 ms)
[ RUN      ] sycl_jagged_containers_test.set_in_kernel
[       OK ] sycl_jagged_containers_test.set_in_kernel (14 ms)
[ RUN      ] sycl_jagged_containers_test.set_in_contiguous_kernel
[       OK ] sycl_jagged_containers_test.set_in_contiguous_kernel (13 ms)
[ RUN      ] sycl_jagged_containers_test.filter
[       OK ] sycl_jagged_containers_test.filter (15 ms)
[ RUN      ] sycl_jagged_containers_test.zero_capacity
[       OK ] sycl_jagged_containers_test.zero_capacity (20 ms)
[----------] 5 tests from sycl_jagged_containers_test (73 ms total)

[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/device_resource (0 ms)
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/host_resource (5 ms)
[ RUN      ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource
[       OK ] sycl_memory_resource_tests/memory_resource_test_basic.allocations/shared_resource (5 ms)
[----------] 3 tests from sycl_memory_resource_tests/memory_resource_test_basic (10 ms total)

[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.int_value/shared_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.double_value/shared_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/host_resource (0 ms)
[ RUN      ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource
[       OK ] sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible.custom_value/shared_resource (0 ms)
[----------] 6 tests from sycl_host_accessible_memory_resource_tests/memory_resource_test_host_accessible (0 ms total)

[----------] Global test environment tear-down
[==========] 19 tests from 4 test suites ran. (139 ms total)
[  PASSED  ] 18 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] sycl_containers_test.atomic_memory

 1 FAILED TEST

The reason I wanted to show these results is that it seems that this is the only failure with this backend by now. 😄

To Reproduce

I made the following example to reproduce the issue in a standalone way:

// SYCL include(s).
#include <CL/sycl.hpp>

// System include(s).
#include <iostream>
#include <memory>

int main() {

   // Create a queue.
   sycl::queue queue;
   std::cout << "Using device: "
             << queue.get_device().get_info<sycl::info::device::name>()
             << std::endl;

   // Allocate a small shared memory buffer.
   static constexpr std::size_t BUFFER_SIZE=10;
   int* buffer = static_cast<int*>(sycl::malloc_shared(BUFFER_SIZE*sizeof(int), queue));

   // Set all elements of it to some well defined value.
   for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
      buffer[i] = 100;
   }

   // The type of atomic reference to use.
   using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                       sycl::memory_scope::device,
                                       sycl::access::address_space::global_space>;

   // Run a kernel that would decrement each value atomically.
   queue.submit([buffer](sycl::handler& h) {
                   h.parallel_for<class atomic_test>(BUFFER_SIZE,
                                                     [buffer](sycl::item<1> id) {
                                                        atomic_ref a(buffer[id.get_linear_id()]);
                                                        a.fetch_sub(5);
                                                     });
                }).wait_and_throw();

   // Check whether the decrement succeeded.
   for (std::size_t i = 0; i < BUFFER_SIZE; ++i) {
      if (buffer[i] != 95) {
         std::cerr << "buffer[" << i << "] = " << buffer[i] << std::endl;
      }
   }

   // Free the buffer.
   sycl::free(buffer, queue);

   // Return gracefully.
   return 0;
}

With this source, I see:

[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx1031 sycl-atomics.cpp
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-amdgcn--amdhsa.bc' is 'amdgcn-unknown-amdhsa' whereas 'sycl-atomics.cpp' is 'amdgcn-amd-amdhsa'
 [-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out 
Using device: AMD Radeon RX 6700 XT
buffer[0] = 100
buffer[1] = 100
buffer[2] = 100
buffer[3] = 100
buffer[4] = 100
buffer[5] = 100
buffer[6] = 100
buffer[7] = 100
buffer[8] = 100
buffer[9] = 100
[bash][pcadp04]:sycl-atomics >

While other backends produce the correct results. Like:

[bash][pcadp04]:sycl-atomics > clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sycl-atomics.cpp
clang-16: warning: CUDA version is newer than the latest supported version 11.5 [-Wunknown-cuda-version]
warning: linking module '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc': Linking two modules of different target triples: '/mnt/hdd1/software/intel/clang-2022-09/x86_64-centos9-gcc11-opt/lib/clang/16.0.0/../../clc/remangled-l64-signed_char.libspirv-nvptx64--nvidiacl.bc' is 'nvptx64-unknown-nvidiacl' whereas 'sycl-atomics.cpp' is 'nvptx64-nvidia-cuda'
 [-Wlinker-warnings]
1 warning generated.
[bash][pcadp04]:sycl-atomics > ./a.out 
Using device: NVIDIA RTX A5000
[bash][pcadp04]:sycl-atomics >

A couple of things to note:

  • When using sycl::buffer for memory management, the issue did not show up. It seems to be limited to using USM.
    • I only tried it with "shared" USM, did not try what happens with "device" USM.
  • I didn't test every atomic operation possible, but for instance fetch_add(...) does work, even with USM. It's fetch_sub(...) that notably doesn't work.

Environment (please complete the following information):

  • OS: CentOS Stream 9, although I've seen the same on Ubuntu 20.04 before as well. (I don't think the exact Linux version matters.)
  • Target device and vendor: AMD GPU
  • DPC++ version: 2022-09
  • Dependencies version: In this latest test I used ROCm-5.3.0, but previously I've seen the same behaviour with ROCm-4.2.0 as well.

Pinging @ivorobts.

@krasznaa krasznaa added the bug Something isn't working label Nov 2, 2022
@AlexeySachkov AlexeySachkov added the hip Issues related to execution on HIP backend. label Nov 2, 2022
@zjin-lcf
Copy link
Contributor

zjin-lcf commented Nov 4, 2022

I could reproduce the error with the DPC++ version: 2022-06.

@krasznaa
Copy link
Contributor Author

Any progress on this? I still see this (buggy) behaviour with sycl-nightly/20230117. 😦

@zjin-lcf
Copy link
Contributor

The device USM works.

@hdelan
Copy link
Contributor

hdelan commented Jan 27, 2023

So all that malloc_shared does is call the underlying shared memory API, hipMallocManaged. It seems that (on my setup) hip atomics are not supported in native HIP. See example:

#include <hip/hip_runtime.h>
#include <iostream>

__global__ void test_atomic(float *fp) {
  atomicAdd(fp, 1);
}

int main() {
  float *ptr;
  hipStream_t s;
  hipStreamCreate(&s);
  hipMallocManaged(&ptr, sizeof(float));
  ptr[0] = 100;
  hipLaunchKernelGGL(test_atomic, 1, 1, 0, s, ptr);
  hipStreamSynchronize(s);
  printf("After: %f\n", ptr[0]);
}

Compiling with:

$ hipcc test.cpp

Run with:

$ ./a.out
[1]    349648 bus error (core dumped)  ./a.out

I will make a change to the PI so that atomics cannot be used with a malloc_shared allocation

Device USM (using sycl::malloc_device) should work with the SYCL atomics for HIP backend. Let me know if you find a case where it is not working and we will investigate.

OS: Ubuntu 20.04
Kernel: 5.11.0-46-generic
Rocm: 4.5.2
GPU: AMD Radeon VII 0.0 [HIP 40421.43]

@krasznaa
Copy link
Contributor Author

😕 That is not what I see.

[bash][pcadp04]:hip > more test.cpp 
#include <hip/hip_runtime.h>
#include <iostream>

__global__ void test_atomic(float *fp) {
  atomicAdd(fp, 1);
}

int main() {
  float *ptr;
  hipStream_t s;
  hipStreamCreate(&s);
  hipMallocManaged(&ptr, sizeof(float));
  ptr[0] = 100;
  hipLaunchKernelGGL(test_atomic, 1, 1, 0, s, ptr);
  hipStreamSynchronize(s);
  printf("After: %f\n", ptr[0]);
}
[bash][pcadp04]:hip > hipcc test.cpp 
[bash][pcadp04]:hip > ./a.out 
After: 101.000000
[bash][pcadp04]:hip >

I didn't see any issue with atomic additions in our unit tests. Only with fetch_sub.

I use:

  • CentOS Stream 9
  • ROCm 5.4.2
  • AMD Radeon RX 6700 XT

@hdelan
Copy link
Contributor

hdelan commented Jan 27, 2023

Aha that is interesting. Can you replicate the problem in HIP or is it only a problem in SYCL?

@krasznaa
Copy link
Contributor Author

Moreover:

[bash][pcadp04]:hip > more test.cpp 
#include <hip/hip_runtime.h>
#include <iostream>

__global__ void test_atomic(float *fp) {
  atomicSub(fp, 1);
}

int main() {
  float *ptr;
  hipStream_t s;
  hipStreamCreate(&s);
  hipMallocManaged(&ptr, sizeof(float));
  ptr[0] = 100;
  hipLaunchKernelGGL(test_atomic, 1, 1, 0, s, ptr);
  hipStreamSynchronize(s);
  printf("After: %f\n", ptr[0]);
}
[bash][pcadp04]:hip > hipcc test.cpp 
[bash][pcadp04]:hip > ./a.out 
After: 99.000000
[bash][pcadp04]:hip >

So I maintain that the problem is somewhere in the Intel compiler. 😦

@krasznaa
Copy link
Contributor Author

krasznaa commented Jan 27, 2023

As I wrote at the beginning, we see this in the unit tests of one of our projects.

But it's only fetch_sub that fails in all of those tests. Every other operation succeeds. (If I take out the fetch_sub call and adjust the expected values at the end of the test, it succeeds.)

@hdelan
Copy link
Contributor

hdelan commented Jan 27, 2023

It turns out that my GPU does not support hipManagedMemory (although it seems to work any other time). I will come back to this when I get my hands on a GPU that supports managed memory

@hdelan
Copy link
Contributor

hdelan commented Jan 27, 2023

@krasznaa if you check out this patch #8003 is the behaviour the same?

@hdelan
Copy link
Contributor

hdelan commented Jan 27, 2023

And can you also confirm that this is only happening with sycl::malloc_shared? ie not with sycl::malloc_device?

@zjin-lcf
Copy link
Contributor

@krasznaa In my experiment, fetch_xor fails with sycl::malloc_shared. However, you mentioned that only fetch_sub fails.

@krasznaa
Copy link
Contributor Author

@krasznaa In my experiment, fetch_xor fails with sycl::malloc_shared. However, you mentioned that only fetch_sub fails.

I should have been more precise. As you can see in https://github.com/acts-project/vecmem/blob/main/tests/sycl/test_sycl_containers.sycl#L184-L216, we only actually test

  • fetch_add;
  • fetch_sub;
  • fetch_and;
  • fetch_or;

in our code at the moment. Out of these 4 only fetch_sub seems to be failing. But I can well imagine that fetch_xor could also be failing. (For our project's purposes testing every single function explicitly in the unit tests seemed much too much of an overkill. Our main project, which builds on this base project, doesn't even actively use fetch_sub at the moment. 😛 So our actual applications can run successfully on AMD cards already. 😄)

@zjin-lcf
Copy link
Contributor

I see. When printing the value in your kernel using sycl::stream, I observe that the value is not updated (i.e. still 100) for atomic_sub in the "shared" USM. However, the compiler generates an atomic sub instruction in the assembly. Evaluating the HIP and SYCL programs in your project will allow people to understand portability.

krasznaa added a commit to krasznaa/vecmem that referenced this issue Feb 2, 2023
…ory.

This is because of intel/llvm#7252, which
identified certain atomic operations not working with the HIP backend
of Intel's LLVM compiler. But only when using shared memory.
@krasznaa
Copy link
Contributor Author

krasznaa commented Feb 2, 2023

It took me a bit to get around to testing it, but you're absolutely right. It is indeed only with shared USM that I see this issue with certain atomic operations. When using "plain" device memory, our unit tests succeed also on AMD hardware. 😉

@hdelan
Copy link
Contributor

hdelan commented Feb 2, 2023

I now have access to a MI210 and can replicate this, so will investigate!

krasznaa added a commit to krasznaa/vecmem that referenced this issue Feb 2, 2023
…ory.

This is because of intel/llvm#7252, which
identified certain atomic operations not working with the HIP backend
of Intel's LLVM compiler. But only when using shared memory.
@zjin-lcf
Copy link
Contributor

@hdelan
Have you found the cause ?

@hdelan
Copy link
Contributor

hdelan commented Feb 28, 2023

Hi @zjin-lcf I have been working on some higher priority stuff recently. I will come back to this as soon as possible!

@hdelan
Copy link
Contributor

hdelan commented Feb 28, 2023

Hi @krasznaa just had a look at this again. I think the problem is in the address space that you construct the atomic_ref with. Malloc shared allocations should use the generic_space, not the global_space since they are not wholly resident on device global memory. This has fixed the issue for me. Please confirm that it works for you.

@jinz2014
Copy link
Contributor

Thank you for your look. I am confused about the space change since people learn to use atomic reference in the following way when atomics are performed over device global memory.

  using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                       sycl::memory_scope::device,
                                       sycl::access::address_space::global_space>;

@hdelan
Copy link
Contributor

hdelan commented Feb 28, 2023

Yes this is the correct way to do it if we are using device global memory. However device global memory is allocated by using malloc_device. malloc_shared uses a cuda driver/hip runtime abstraction hipMallocManaged that allows the allocation to be accessible both on device global memory and in host memory. Therefore tglobal memory is used by a malloc_shared allocation, but it is not just a global memory allocation.

When we construct an atomic ref with AS == global_space we must be overriding the mechanism that makes changed sections of memory migrate back to host before it is read. With usual reads and writes to managed memory, there will be an additional write to a flag variable, that lets say host memory know that the memory it holds has been invalidated, meaning a memcpy is needed before an allocation is read on another device or host. I hypothesize that the atomic_ref with global scope does not know that it has to write to this flag variable, which would invalidate the memory on host. This may allow for better performance of normal global atomics if managed memory is not used (managed memory has very poor performance because of this migration method). At the moment the way to work around this is to use the generic address space which makes no assumptions about the address space of pointers. The HIP API uses the generic address space as well although it is implicit, which is why we don't have to worry about this problem in HIP

The best change would be just to change the atomic_ref address space to the generic_space, which does not interfere with the migration mechanisms of the hipMallocManaged API

@jinz2014
Copy link
Contributor

Thank you for your explanation !

@hdelan
Copy link
Contributor

hdelan commented Mar 1, 2023

To clarify a bit more: hipcc automatically defaults to the generic AS in a lot of cases. See a simple atomic_sub call here:

; Function Attrs: convergent mustprogress norecurse nounwind
define protected amdgpu_kernel void @_Z11test_atomicPf(float addrspace(1)* %0) #5 {
  %2 = alloca float*, align 8, addrspace(5)
  %3 = alloca float*, align 8, addrspace(5)
  %4 = addrspacecast float* addrspace(5)* %2 to float**
  %5 = addrspacecast float* addrspace(5)* %3 to float**
  %6 = addrspacecast float addrspace(1)* %0 to float*
  store float* %6, float** %4, align 8
  %7 = load float*, float** %4, align 8, !tbaa !4
  store float* %7, float** %5, align 8, !tbaa !4
  %8 = load float*, float** %5, align 8, !tbaa !4
  %9 = call contract float @_Z9atomicSubPff(float* %8, float 1.000000e+00) #20
  ret void
}

We see the arg %0 which is in global AS (addrspace(1)) gets cast to generic AS (no addrspace annotation) here:

  %6 = addrspacecast float addrspace(1)* %0 to float*

In SYCL because we explicitly use the global AS when constructing the atomic_ref this cast does not happen (it also might be the reason why global atomics are sometimes faster in SYCL than they are in HIP). We could make the global space default to the generic space for HIP backend of DPC++, which would fix this edge case bug, but it would also likely degrade performance of other atomic ops for global AS (not with malloc_shared allocation). I am not sure that it would be easy to change the spec to specify that atomic_ref should use the template parameter generic_space if a malloc_shared allocation is used. But I think in this use case this is what we have to do to avoid performance degradation for global atomics

@ldrumm
Copy link
Contributor

ldrumm commented Apr 12, 2023

I've just reproduced this on a W6800 (gfx1030) and will discuss with my colleague @hdelan to find the right fix. I'll update this ticket with my progress

@ldrumm
Copy link
Contributor

ldrumm commented Apr 14, 2023

Okay this is going to be a lot of fun. The compiler is generating almost identical code save for the instruction nmemonic for fetch_add, yet fetch_add works as expected. I'm going to be reading the RDNA2 ISA manual to see if there's some weird dependency the compiler isn't aware of, or to confirm that the instruction flags are right, or that the instruction itself is suitable for the data.

--- ./sycl-atomics-sycl-amdgcn-amd-amdhsa-gfx1030-1d7b42.s
+++ ./sycl-atomics-sycl-amdgcn-amd-amdhsa-gfx1030-795b4d.s
@@ -27,7 +27,7 @@
        v_mov_b32_e32 v2, 5
        v_add_co_u32 v0, vcc_lo, s6, v0
        v_add_co_ci_u32_e32 v1, vcc_lo, s7, v1, vcc_lo
-       global_atomic_add v[0:1], v2, off
+       global_atomic_sub v[0:1], v2, off
 .LBB0_2:                                ; %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlS4_E_EclES4_.exit
        s_endpgm
        .section        .rodata,#alloc
@@ -135,7 +135,7 @@
        v_mov_b32_e32 v2, 5
        v_add_co_u32 v0, vcc_lo, s6, v0
        v_add_co_ci_u32_e32 v1, vcc_lo, s7, v1, vcc_lo
-       global_atomic_add v[0:1], v2, off
+       global_atomic_sub v[0:1], v2, off
 .LBB1_2:                                ; %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlS4_E_EclES4_.exit
        s_endpgm
        .section        .rodata,#alloc
@@ -231,7 +231,7 @@
        v_lshlrev_b64 v[2:3], 2, v[2:3]
        v_add_co_u32 v2, vcc_lo, s4, v2
        v_add_co_ci_u32_e32 v3, vcc_lo, s5, v3, vcc_lo
-       global_atomic_add v[2:3], v0, off
+       global_atomic_sub v[2:3], v0, off
        s_endpgm
        .section        .rodata,#alloc
        .p2align        6, 0x0
@@ -331,7 +331,7 @@
        v_lshlrev_b64 v[0:1], 2, v[0:1]
        v_add_co_u32 v0, vcc_lo, s4, v0
        v_add_co_ci_u32_e32 v1, vcc_lo, s5, v1, vcc_lo
-       global_atomic_add v[0:1], v5, off
+       global_atomic_sub v[0:1], v5, off
        s_endpgm
        .section        .rodata,#alloc
        .p2align        6, 0x0

@zjin-lcf
Copy link
Contributor

@ldrumm atomic_xor may be included in your investigation.

@jinz2014
Copy link
Contributor

jinz2014 commented May 9, 2023

@ldrumm Any findings you could describe ?

@ldrumm
Copy link
Contributor

ldrumm commented May 10, 2023

Yes. This has been a bit of a rollercoaster!

My investigations initially took me down the ISA path, as hinted at above, but I've now ruled out hardware and compiler bugs (I was this close to sending mail to AMD proudly declaring an ISA bug at one point). Luckily for all of us, this looks like a synchronization problem with the SYCL runtime, and is therefore fixable.

It's still not clear to me why the global_atomic_add seems to synchronize fine, yet the global_atomic_sub does not, but using the following example should hopefully recreate for you:

#include <iostream>
#include <memory>

#include <CL/sycl.hpp>
#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime_api.h>

using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
                                    sycl::memory_scope::device,
                                    sycl::access::address_space::global_space>;

static constexpr std::size_t BUFFER_SIZE = 1;
static constexpr std::size_t N = BUFFER_SIZE * sizeof(int);

int main()
{
  sycl::queue queue;
  std::cerr << "Using device: "
            << queue.get_device().get_info<sycl::info::device::name>() << '\n';

  int* buffer = static_cast<int*>(sycl::malloc_shared(N, queue));

  for (size_t i = 0; i < BUFFER_SIZE; ++i)
    buffer[i] = 5;

  // FIXME Why does this work? Do we need to synchronize like this in the SYCL
  // runtime?
  hipMemPrefetchAsync(buffer, N, 0);
  queue
    .submit([buffer](sycl::handler& h) {
      h.single_task<class atomic_test>([buffer]() {
        atomic_ref a(buffer[0]);
        a.fetch_sub(1);
      });
    })
    .wait_and_throw();

  for (std::size_t i = 0; i < BUFFER_SIZE; ++i)
    std::cerr << "buffer[" << i << "] = " << buffer[i] << '\n';

  sycl::free(buffer, queue);
}

Note that commenting out the hipMemPrefetchAsync makes it fail again.
Should have a fix soon

@ldrumm
Copy link
Contributor

ldrumm commented May 10, 2023

@ldrumm atomic_xor may be included in your investigation.

Reproduced. Also fixed by the mysterious prefetch

@jinz2014
Copy link
Contributor

I see. There are some bugs in the SYCL runtime. When the original SYCL example is converted to a HIP program, the HIP program executes successfully. Thank you for the example!

ldrumm added a commit to ldrumm/llvm that referenced this issue Jun 9, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com/intel/issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places.
The context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires requires the associated HIP
stream object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
@ldrumm
Copy link
Contributor

ldrumm commented Jun 9, 2023

If you permit me, I'd like to spin a yarn, and document my learning experience
here. It's been a bit of a fun tour, and I've learned a fair bit. Long story
short: we need a prefetch in HIP, RoCM, or the SYCL runtime to be correct.

A HIP reproducer

Here is a HIP program that manifests the same problem as the SYCL kernel

This is based on the rocm5.5.x branch running on a W6800 gfx1030
card on a Linux Linux 6.1.0-1006-oem #6-Ubuntu SMP PREEMPT_DYNAMIC kernel.

Note that it uses __hip_atomic_fetch_sub which I only recently sent to
upstream llvm). To recreate without having to cherry-pick that
patch, you can also use __hip_atomic_fetch_xor in the kernel body.

#include <cassert>

#include "hip/hip_runtime.h"

__global__ void test_natural_sub(int* data, int x, int y)
{
  __hip_atomic_fetch_add(data, x, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_SYSTEM);
  __hip_atomic_fetch_sub(data, y, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_SYSTEM);
}

#define X (5)
#define Y (11)
#define INIT (3)

#define log(s) fprintf(stderr, "file %s: line %d: %s\n", __FILE__, __LINE__, s)

int main(int argc, char** argv)
{
  hipDeviceProp_t props;
  hipGetDeviceProperties(&props, 0);
  assert(props.managedMemory);
  printf(
    "> GPU device has %d Multi-Processors, SM %d.%d compute
    capabilities\n\n", props.multiProcessorCount, props.major, props.minor
  );
  // Allocate managed memory of `N` items, initialize it with `INIT`, and
  // execute our simple kernel
  int* data = nullptr;
  if (hipMallocManaged(&data, N * sizeof *data, hipMemAttachGlobal) !=
      HIP_SUCCESS)
    abort();
  fprintf(stderr, "ptr:%p\n", data);
  assert(data);
  hipStream_t s;
  hipStreamCreate(&s);
  for (size_t i = 0; i < N; ++i)
    data[i] = INIT;

  log("before prefetch");
  // XXX COMMENT ME OUT!
  hipMemPrefetchAsync(data, N * sizeof *data, 0);
  log("after prefetch");

  log("before launch");
  hipLaunchKernelGGL(test_natural_sub, 1, 1, 0, s, data, X, Y);
  log("after launch");

  log("before synchronize");
  hipDeviceSynchronize();
  log("after synchronize");

  putchar('{');
  for (size_t i = 0; i < N; ++i)
    printf("%s%d (%d)", i ? ", " : "", data[i], INIT + X - Y)
  puts("}");
  hipFree(data);
  hipDeviceReset();
}

Commenting out the hipMemPrefetchAsync before the kernel launch also manifests
the same problem we've seen in the sycl program that spawned this ticket: the
addition happens, but the subtraction doesn't appear to do anything (again xor
is even equally broken).

Compiling with hipcc gives the following assembly for the GPU kernel (annotated
by me):

_Z16test_natural_subPiii:               ; @_Z16test_natural_subPiii
.Lfunc_begin0:
	s_load_dwordx4 s[0:3], s[4:5], 0x0 ; Load scalar arguments into s[0:3]
	v_mov_b32_e32 v0, 0 ; Set an offset register to zero
	s_waitcnt lgkmcnt(0) ; instruction ordering barrier - ensures argument loading has completed
	v_mov_b32_e32 v1, s2 ; move `x`, into `v1`
	v_mov_b32_e32 v2, s3 ; move `y`, into `v2`
	s_clause 0x1 ; ensure ordering of the following instruction happens as written
	global_atomic_add v0, v1, s[0:1] ; atomically add v1 to the value at address (s[0:1] + v0)
	global_atomic_sub v0, v2, s[0:1] ; atomically subtract v2 from the value at address (s[0:1] + v0)
	s_endpgm ; return

and here's the hsa kernel descriptor:

.amdhsa_kernel _Z16test_natural_subPiii
    .amdhsa_group_segment_fixed_size 0
    .amdhsa_private_segment_fixed_size 0
    .amdhsa_kernarg_size 16
    .amdhsa_user_sgpr_count 6
    .amdhsa_user_sgpr_private_segment_buffer 1
    .amdhsa_user_sgpr_dispatch_ptr 0
    .amdhsa_user_sgpr_queue_ptr 0
    .amdhsa_user_sgpr_kernarg_segment_ptr 1
    .amdhsa_user_sgpr_dispatch_id 0
    .amdhsa_user_sgpr_flat_scratch_init 0
    .amdhsa_user_sgpr_private_segment_size 0
    .amdhsa_wavefront_size32 1
    .amdhsa_uses_dynamic_stack 0
    .amdhsa_system_sgpr_private_segment_wavefront_offset 0
    .amdhsa_system_sgpr_workgroup_id_x 1
    .amdhsa_system_sgpr_workgroup_id_y 0
    .amdhsa_system_sgpr_workgroup_id_z 0
    .amdhsa_system_sgpr_workgroup_info 0
    .amdhsa_system_vgpr_workitem_id 0
    .amdhsa_next_free_vgpr 3
    .amdhsa_next_free_sgpr 6
    .amdhsa_reserve_vcc 0
    .amdhsa_reserve_flat_scratch 0
    .amdhsa_float_round_mode_32 0
    .amdhsa_float_round_mode_16_64 0
    .amdhsa_float_denorm_mode_32 3
    .amdhsa_float_denorm_mode_16_64 3
    .amdhsa_dx10_clamp 1
    .amdhsa_ieee_mode 1
    .amdhsa_fp16_overflow 0
    .amdhsa_workgroup_processor_mode 1
    .amdhsa_memory_ordered 1
    .amdhsa_forward_progress 0
    .amdhsa_shared_vgpr_count 0
    .amdhsa_exception_fp_ieee_invalid_op 0
    .amdhsa_exception_fp_denorm_src 0
    .amdhsa_exception_fp_ieee_div_zero 0
    .amdhsa_exception_fp_ieee_overflow 0
    .amdhsa_exception_fp_ieee_underflow 0
    .amdhsa_exception_fp_ieee_inexact 0
    .amdhsa_exception_int_div_zero 0
.end_amdhsa_kernel

To me this all looks correct. The arguments are loaded into vector registers,
and the atomic instructions operate on them with forced ordering. The CP waits
until the s_endpgm instruction which then triggers an interrupt to the system.

The atomic instructions add/sub to the address in s[0:1] with offset v0 (with
value zero)

So:

  • The schedule appears correct
  • The opcodes and operands appear correct
  • There are no undefined bits set to nonzero in the encoding (checked with a
    hexeditor and in tablegen against the RDNA2 ISA manual)
  • There is an ordering clause guarding the atomic operations to enforce a
    correct ordering.
  • The kernel descriptor doesn't seem to do anything crazy
    (.amdhsa_user_sgpr_flat_scratch_init should be zero on gfx1030)

As I said before: I think we can rule out compiler bugs here as this looks fine.

GPU Runtime

If you dig into the strace logs, and the Linux driver, you'll see that the
prefetch isn't just an optimization, it's actually required for correctness.
The strace log shows that the prefetch causes a set of extra events to be queued
to the kernel driver:

[...]
4036 1863090 ioctl(3, AMDKFD_IOC_ALLOC_MEMORY_OF_GPU, 0x7fff5ada9030 <unfinished ...>
4037 1863091 ioctl(3, AMDKFD_IOC_WAIT_EVENTS, 0x7fabc913e760 <unfinished ...>
4038 1863090 <... ioctl resumed>)            = 0
4039 1863090 ioctl(3, AMDKFD_IOC_MAP_MEMORY_TO_GPU, 0x7fff5ada8fb0) = 0
4040 1863090 ioctl(3, AMDKFD_IOC_CREATE_EVENT, 0x7fff5ada9240) = 0
[...] AMDKFD_IOC_CREATE_EVENT repeats
4071 1863090 ioctl(3, AMDKFD_IOC_CREATE_EVENT, 0x7fff5ada9240) = 0
4072 1863090 ioctl(3, AMDKFD_IOC_SET_EVENT, 0x7fff5ada9300) = 0
4073 1863091 <... ioctl resumed>)            = 0
4074 1863091 ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x4b, 0x20, 0x20), 0x7fabc913e8b0) = 0
4075 1863091 ioctl(3, AMDKFD_IOC_SET_EVENT, 0x7fabc913e8b0 <unfinished ...>
4076 1863090 write(2, "file hipSimpleAtomicsTest.cpp: line 42: after prefetch\n", 55 <unfinished ...>
[...]

HIP uses the rocm runtime which communicates with libdrm and thus the Linux
kernel via libhsa. On intercepting the syscalls, you'll see extra ioctls during
kernel launch
for the version with the hipMemPrefetchAsync. The fact that
these extra syscalls only occur on the actual launch is slightly
confounding, but it means that the actual prefetch doesn't happen in the kernel
driver until a compute kernel is queued to the GPU via the DRM interface. That's
clearly a good optimization to make in the userspace queue to avoid unnecessary
syscalls, but it made my debugging a little harder!

Stepping up a little from the DRM layer you can see this difference with more
clarity by setting...

$ export AMD_LOG_LEVEL=4

...before launching the hip program and looking at what HSA is doing

HSA

Here is an excerpt from two such log runs [with_prefetch log]with_prefect[log files. The excerpt below is the
above program with the prefetch; which occurs just before kernel dispatch to
the system driver. I've edited the raw logs to remove timing information and
replaced pointer values with IDs so we can focus less on their values, and more
on the semantics. diffing them is instructive, but too verbose for this writeup,
so I'll just include the salient bit and leave the curious to look at the
attached files...

:4:rocvirtual.cpp           :1014: : [tid:ptr0] HWq=ptr26, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=ptr24
:4:command.cpp              :349 : : [tid:ptr0] Command (Marker) enqueued: ptr27
:3:rocvirtual.cpp           :458 : : [tid:ptr0] Set Handler: handle(ptr28), timestamp(ptr29)
:4:rocvirtual.cpp           :1014: : [tid:ptr0] HWq=ptr30, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[ptr24, 0x0, 0x0, 0x0, 0x0], completion_signal=ptr28

To understand what's going on here, we need to understand the HSA runtime model,
which is handled in HIP/Rocm in the VirtualGPU class.
The VirtualGPU device in rocm has a queue, and a single barrier instance that
is reused
. It is an hsa_barrier_and_packet_t which
a simple structure that contains a list of dependents, a header, and a
completion signal member.

The barrier is set with a signal, and queued to the device:

    header = 0x1503 (
        type = HSA_BARRIER_TYPE_AND,
        barrier=true,
        aquire_fence_scope = HSA_FENCE_SCOPE_SYSTEM,
        release_fence_scope = HSA_FENCE_SCOPE_SYSTEM
    )

The barrier, aquire, release, and type fields are specified in the
HSA spec in section 2.9:

  • barrier: If set, then processing of the packet will only begin when all
    preceding packets are complete.
  • acquire_fence_scope: Determines the scope and type of the acquire memory fence
    operation for a packet
  • release_fence_scope: Determines the scope and type of the memory fence
    operation applied after kernel completion but before the packet is completed

And the values:

HSA_FENCE_SCOPE_SYSTEM:

The fence is applied across both agent and system scope for the global segment

Thus without the given fence packet being queued to the driver before
execution of the kernel, the memory semantics are not configured correctly on
the device, and atomic operations are not guaranteed to be defined at system
scope, and the kernel doesn't appear to execute correctly.

The queue here belongs to RoCM and the HSA runtime appears to be correctly
queueing events.

RoCM

Given the above definitions combined with the logs, I think we can say that
without the barrier on the kernel execution, we can't rely on the hardware
correctly configuring the memory subsystem to properly insert fence any
load/store instructions - thus without the hipMemPrefetchAsync, the HSA/RocM
level won't do the right thing.

So this appears to be a HIP bug, or a dpc++ runtime bug

HIP

There are some high-level questions with the HIP runtime and how it interacts
with the ROC/HSA driver semantics:

What guarantees is hipMallocManaged supposed to have?

The existing AMD documentation states that the function

"Allocates memory that will be automatically managed by HIP".

That's notolotto go on, so we can look at its CUDA namesake:

"Allocates memory that will be automatically managed by the Unified Memory
system."

Again - there's not a lot of clarity about what "automatically managed" means.

The hip testsuite has atomic tests (atomicAdd,atomicSub) et al, which expand to
the builtins used in the reproducer above. The hipMallocManaged tests uses prefetch instructions
everywhere, yet the atomic tests don't check with managed allocations. Given that it's test code, my guess is that AMD do expect users to prefetch, but the documentation is inadequate.

The hipMallocManaged implementation does some questionable things too.
Its third argument is checked but ignored, and then the internal implementation
allocates an SVM buffer with CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_ALLOC_HOST_POINTER. CL_DEVICE_SVM_ATOMICS is notable by its absence.
I'm not sure whether this is just not required in AMD's SVM (USM) implemtation,
but adding that flag doesn't seem to change our observed behaviour.

A blog by NVIDIA
implies that the semantics should be that memory allocated by
cudaMallocManaged does not need explicit prefetching, and that this is just an
optimization. Another post linked therein supports this
theory

All this to say: the intended semantics are unclear for HIP and CUDA, and the implementation we can read doesn't make it clear either. To cloud it further, the information returned by hipGetPointerAttributes on a pointer returned by hipMallocManaged shows type = hipMemoryTypeHost, isManaged=1, and the type hipMemoryTypeUnified has a comment saying // not used currently so there appears to be some confusion in the runtime in how these flags should be set. and what is and isn't supported.

Why does this seem to work with addition, but reliably and repeatably fails with atomic subtraction or exclusive-or?

I honestly have no good idea. My assumptions are that there is something at the
hardware/microcode level that latches the memory controller for addition, but
somehow this doesn't work for sub/xor.
I'm not a hardware guy, and have no visibility to this. I suspect I'll never
know unless someone at AMD has insight they can share.

SYCL/dpc++ / a workaround

There's a draft patch here which should go into the new Unified Runtime. It
fixes the missing prefetch before kernel launch. In the event that AMD clarify
the semantics of hipMallocManaged we may later be able to remove this, but for
now it's required for correctness. Please checkout this branch and rerun your tests.

amd_log_no_prefetch.txt
amd_log_with_prefetch.txt

[log files]: The logs are attached as amd_log_with_prefetch.txt amd_log_no_prefetch.txt

ldrumm added a commit to ldrumm/llvm that referenced this issue Aug 7, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com/intel/issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places.
The context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires the associated HIP stream
object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
aelovikov-intel pushed a commit that referenced this issue Aug 24, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com//issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places. The
context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires the associated HIP stream
object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
@ldrumm
Copy link
Contributor

ldrumm commented Aug 24, 2023

Fixed in #10430

This should be shipped in the next release or is available at top of tree. @krasznaa please reopen this ticket if it doesn't work for you

@ldrumm ldrumm closed this as completed Aug 24, 2023
veselypeta pushed a commit to veselypeta/llvm that referenced this issue Sep 28, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com/intel/issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places. The
context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires the associated HIP stream
object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
szadam pushed a commit to szadam/unified-runtime that referenced this issue Oct 13, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com/intel/llvm/issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places. The
context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires the associated HIP stream
object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this issue Oct 23, 2023
This change is necessary to workaround a delightful bug in either HIP
runtime, or the HIP spec.

It's discussed at length in github.com/intel/llvm/issues/7252 but for
the purposes of this patch, it suffices to say that a call to
`hipMemPrefetchAsync` is *required* for correctness in the face of
global atomic operations on (*at least*) shared USM allocations.

The architecture of this change is slightly strange on first sight in
that we reduntantly track allocation information in several places. The
context now keeps track of all USM mappings. We require a mapping of
pointers to the allocated size, but these allocations aren't pinned to
any particular queue or HIP stream.
The `hipMemPrefetchAsync`, however, requires the associated HIP stream
object, and the size of the allocation. The stream comes
hot-off-the-queue *only* just before a kernel is launched, so we need to
defer the prefetch until we have that information.

Finally, the kernel itself keeps track of pointer arguments in a more
accessible way so we can determine which of the kernel's pointer
arguments do, in-fact, point to USM allocations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

6 participants