-
Notifications
You must be signed in to change notification settings - Fork 765
[SYCL][HIP] sycl::atomic_ref::fetch_sub does not work with USM #7252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I could reproduce the error with the DPC++ version: 2022-06. |
Any progress on this? I still see this (buggy) behaviour with sycl-nightly/20230117. 😦 |
The device USM works. |
So all that
Compiling with:
Run with:
I will make a change to the PI so that atomics cannot be used with a Device USM (using OS: Ubuntu 20.04 |
😕 That is not what I see.
I didn't see any issue with atomic additions in our unit tests. Only with I use:
|
Aha that is interesting. Can you replicate the problem in HIP or is it only a problem in SYCL? |
Moreover:
So I maintain that the problem is somewhere in the Intel compiler. 😦 |
As I wrote at the beginning, we see this in the unit tests of one of our projects.
But it's only |
It turns out that my GPU does not support |
And can you also confirm that this is only happening with |
@krasznaa In my experiment, fetch_xor fails with |
I should have been more precise. As you can see in https://github.com/acts-project/vecmem/blob/main/tests/sycl/test_sycl_containers.sycl#L184-L216, we only actually test
in our code at the moment. Out of these 4 only |
I see. When printing the value in your kernel using sycl::stream, I observe that the value is not updated (i.e. still 100) for atomic_sub in the "shared" USM. However, the compiler generates an atomic sub instruction in the assembly. Evaluating the HIP and SYCL programs in your project will allow people to understand portability. |
…ory. This is because of intel/llvm#7252, which identified certain atomic operations not working with the HIP backend of Intel's LLVM compiler. But only when using shared memory.
It took me a bit to get around to testing it, but you're absolutely right. It is indeed only with shared USM that I see this issue with certain atomic operations. When using "plain" device memory, our unit tests succeed also on AMD hardware. 😉 |
I now have access to a MI210 and can replicate this, so will investigate! |
…ory. This is because of intel/llvm#7252, which identified certain atomic operations not working with the HIP backend of Intel's LLVM compiler. But only when using shared memory.
@hdelan |
Hi @zjin-lcf I have been working on some higher priority stuff recently. I will come back to this as soon as possible! |
Hi @krasznaa just had a look at this again. I think the problem is in the address space that you construct the atomic_ref with. Malloc shared allocations should use the |
Thank you for your look. I am confused about the space change since people learn to use atomic reference in the following way when atomics are performed over device global memory.
|
Yes this is the correct way to do it if we are using device global memory. However device global memory is allocated by using When we construct an atomic ref with AS == The best change would be just to change the |
Thank you for your explanation ! |
To clarify a bit more:
We see the arg
In SYCL because we explicitly use the global AS when constructing the |
I've just reproduced this on a W6800 (gfx1030) and will discuss with my colleague @hdelan to find the right fix. I'll update this ticket with my progress |
Okay this is going to be a lot of fun. The compiler is generating almost identical code save for the instruction nmemonic for --- ./sycl-atomics-sycl-amdgcn-amd-amdhsa-gfx1030-1d7b42.s
+++ ./sycl-atomics-sycl-amdgcn-amd-amdhsa-gfx1030-795b4d.s
@@ -27,7 +27,7 @@
v_mov_b32_e32 v2, 5
v_add_co_u32 v0, vcc_lo, s6, v0
v_add_co_ci_u32_e32 v1, vcc_lo, s7, v1, vcc_lo
- global_atomic_add v[0:1], v2, off
+ global_atomic_sub v[0:1], v2, off
.LBB0_2: ; %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlS4_E_EclES4_.exit
s_endpgm
.section .rodata,#alloc
@@ -135,7 +135,7 @@
v_mov_b32_e32 v2, 5
v_add_co_u32 v0, vcc_lo, s6, v0
v_add_co_ci_u32_e32 v1, vcc_lo, s7, v1, vcc_lo
- global_atomic_add v[0:1], v2, off
+ global_atomic_sub v[0:1], v2, off
.LBB1_2: ; %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlS4_E_EclES4_.exit
s_endpgm
.section .rodata,#alloc
@@ -231,7 +231,7 @@
v_lshlrev_b64 v[2:3], 2, v[2:3]
v_add_co_u32 v2, vcc_lo, s4, v2
v_add_co_ci_u32_e32 v3, vcc_lo, s5, v3, vcc_lo
- global_atomic_add v[2:3], v0, off
+ global_atomic_sub v[2:3], v0, off
s_endpgm
.section .rodata,#alloc
.p2align 6, 0x0
@@ -331,7 +331,7 @@
v_lshlrev_b64 v[0:1], 2, v[0:1]
v_add_co_u32 v0, vcc_lo, s4, v0
v_add_co_ci_u32_e32 v1, vcc_lo, s5, v1, vcc_lo
- global_atomic_add v[0:1], v5, off
+ global_atomic_sub v[0:1], v5, off
s_endpgm
.section .rodata,#alloc
.p2align 6, 0x0 |
@ldrumm atomic_xor may be included in your investigation. |
@ldrumm Any findings you could describe ? |
Yes. This has been a bit of a rollercoaster! My investigations initially took me down the ISA path, as hinted at above, but I've now ruled out hardware and compiler bugs (I was this close to sending mail to AMD proudly declaring an ISA bug at one point). Luckily for all of us, this looks like a synchronization problem with the SYCL runtime, and is therefore fixable. It's still not clear to me why the #include <iostream>
#include <memory>
#include <CL/sycl.hpp>
#define __HIP_PLATFORM_AMD__
#include <hip/hip_runtime_api.h>
using atomic_ref = sycl::atomic_ref<int, sycl::memory_order::relaxed,
sycl::memory_scope::device,
sycl::access::address_space::global_space>;
static constexpr std::size_t BUFFER_SIZE = 1;
static constexpr std::size_t N = BUFFER_SIZE * sizeof(int);
int main()
{
sycl::queue queue;
std::cerr << "Using device: "
<< queue.get_device().get_info<sycl::info::device::name>() << '\n';
int* buffer = static_cast<int*>(sycl::malloc_shared(N, queue));
for (size_t i = 0; i < BUFFER_SIZE; ++i)
buffer[i] = 5;
// FIXME Why does this work? Do we need to synchronize like this in the SYCL
// runtime?
hipMemPrefetchAsync(buffer, N, 0);
queue
.submit([buffer](sycl::handler& h) {
h.single_task<class atomic_test>([buffer]() {
atomic_ref a(buffer[0]);
a.fetch_sub(1);
});
})
.wait_and_throw();
for (std::size_t i = 0; i < BUFFER_SIZE; ++i)
std::cerr << "buffer[" << i << "] = " << buffer[i] << '\n';
sycl::free(buffer, queue);
} Note that commenting out the |
Reproduced. Also fixed by the mysterious prefetch |
I see. There are some bugs in the SYCL runtime. When the original SYCL example is converted to a HIP program, the HIP program executes successfully. Thank you for the example! |
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com/intel/issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
If you permit me, I'd like to spin a yarn, and document my learning experience A HIP reproducerHere is a HIP program that manifests the same problem as the SYCL kernel This is based on the rocm5.5.x branch running on a W6800 gfx1030 Note that it uses #include <cassert>
#include "hip/hip_runtime.h"
__global__ void test_natural_sub(int* data, int x, int y)
{
__hip_atomic_fetch_add(data, x, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_SYSTEM);
__hip_atomic_fetch_sub(data, y, __ATOMIC_RELAXED, __HIP_MEMORY_SCOPE_SYSTEM);
}
#define X (5)
#define Y (11)
#define INIT (3)
#define log(s) fprintf(stderr, "file %s: line %d: %s\n", __FILE__, __LINE__, s)
int main(int argc, char** argv)
{
hipDeviceProp_t props;
hipGetDeviceProperties(&props, 0);
assert(props.managedMemory);
printf(
"> GPU device has %d Multi-Processors, SM %d.%d compute
capabilities\n\n", props.multiProcessorCount, props.major, props.minor
);
// Allocate managed memory of `N` items, initialize it with `INIT`, and
// execute our simple kernel
int* data = nullptr;
if (hipMallocManaged(&data, N * sizeof *data, hipMemAttachGlobal) !=
HIP_SUCCESS)
abort();
fprintf(stderr, "ptr:%p\n", data);
assert(data);
hipStream_t s;
hipStreamCreate(&s);
for (size_t i = 0; i < N; ++i)
data[i] = INIT;
log("before prefetch");
// XXX COMMENT ME OUT!
hipMemPrefetchAsync(data, N * sizeof *data, 0);
log("after prefetch");
log("before launch");
hipLaunchKernelGGL(test_natural_sub, 1, 1, 0, s, data, X, Y);
log("after launch");
log("before synchronize");
hipDeviceSynchronize();
log("after synchronize");
putchar('{');
for (size_t i = 0; i < N; ++i)
printf("%s%d (%d)", i ? ", " : "", data[i], INIT + X - Y)
puts("}");
hipFree(data);
hipDeviceReset();
} Commenting out the Compiling with _Z16test_natural_subPiii: ; @_Z16test_natural_subPiii
.Lfunc_begin0:
s_load_dwordx4 s[0:3], s[4:5], 0x0 ; Load scalar arguments into s[0:3]
v_mov_b32_e32 v0, 0 ; Set an offset register to zero
s_waitcnt lgkmcnt(0) ; instruction ordering barrier - ensures argument loading has completed
v_mov_b32_e32 v1, s2 ; move `x`, into `v1`
v_mov_b32_e32 v2, s3 ; move `y`, into `v2`
s_clause 0x1 ; ensure ordering of the following instruction happens as written
global_atomic_add v0, v1, s[0:1] ; atomically add v1 to the value at address (s[0:1] + v0)
global_atomic_sub v0, v2, s[0:1] ; atomically subtract v2 from the value at address (s[0:1] + v0)
s_endpgm ; return and here's the hsa kernel descriptor:
To me this all looks correct. The arguments are loaded into vector registers, The atomic instructions add/sub to the address in So:
As I said before: I think we can rule out compiler bugs here as this looks fine. GPU RuntimeIf you dig into the
HIP uses the rocm runtime which communicates with libdrm and thus the Linux Stepping up a little from the DRM layer you can see this difference with more
...before launching the hip program and looking at what HSA is doing HSAHere is an excerpt from two such log runs [with_prefetch log]with_prefect[log files. The excerpt below is the
To understand what's going on here, we need to understand the HSA runtime model, The barrier is set with a signal, and queued to the device:
The
And the values:
Thus without the given fence packet being queued to the driver before The queue here belongs to RoCM and the HSA runtime appears to be correctly RoCMGiven the above definitions combined with the logs, I think we can say that So this appears to be a HIP bug, or a dpc++ runtime bug HIPThere are some high-level questions with the HIP runtime and how it interacts What guarantees is
|
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com/intel/issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com//issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com/intel/issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com/intel/llvm/issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
This change is necessary to workaround a delightful bug in either HIP runtime, or the HIP spec. It's discussed at length in github.com/intel/llvm/issues/7252 but for the purposes of this patch, it suffices to say that a call to `hipMemPrefetchAsync` is *required* for correctness in the face of global atomic operations on (*at least*) shared USM allocations. The architecture of this change is slightly strange on first sight in that we reduntantly track allocation information in several places. The context now keeps track of all USM mappings. We require a mapping of pointers to the allocated size, but these allocations aren't pinned to any particular queue or HIP stream. The `hipMemPrefetchAsync`, however, requires the associated HIP stream object, and the size of the allocation. The stream comes hot-off-the-queue *only* just before a kernel is launched, so we need to defer the prefetch until we have that information. Finally, the kernel itself keeps track of pointer arguments in a more accessible way so we can determine which of the kernel's pointer arguments do, in-fact, point to USM allocations.
Describe the bug
While testing our project (https://github.com/acts-project/vecmem) with the HIP backend of the compiler, I found a failure in the unit tests with atomic operations.
The reason I wanted to show these results is that it seems that this is the only failure with this backend by now. 😄
To Reproduce
I made the following example to reproduce the issue in a standalone way:
With this source, I see:
While other backends produce the correct results. Like:
A couple of things to note:
sycl::buffer
for memory management, the issue did not show up. It seems to be limited to using USM.fetch_add(...)
does work, even with USM. It'sfetch_sub(...)
that notably doesn't work.Environment (please complete the following information):
2022-09
Pinging @ivorobts.
The text was updated successfully, but these errors were encountered: