[OMPT] Overlapping `device_num` when using multiple offloading architectures #65104

Thyre · 2023-08-30T08:48:28Z

Description

This issue was initially found during discussion of #64487. However, this should be handled as a separate issue, especially since the other one is closed now. This issue doesn't affect me directly, but there are users which might run into this issue, especially in mixed GPU systems.

When a user utilizes multiple architectures to offload to in his program (for example -fopenmp-targets=x86_64,nvptx64, the OMPT interface will receive callbacks for ompt_callback_device_initialize for each device used. While this works fine, we may encounter a case where the transferred device number for two different architectures overlap, causing issues for tools to differentiate the devices.

Users can select their device by the device number, for example

int main( void )
{
// x86_64 (four devices)
#pragma omp target device(0)
{...}
// nvptx64 (other devices)
#pragma omp target device(4)
{...}
}

This would translate to the following in the OMPT interface

Callback Init: device_num=0 type=generic-64bit
[...]
Callback Init: device_num=0 type=sm_80

Reproducer

This small code should show the issue. I have tested it with with nvptx64 + x86_64 and nvptx64 + amdgcn-amd-amdhsa, but it should also affect other combinations

#include <assert.h>
#include <stdio.h>
#include <omp-tools.h>


void callback_ompt_device_initialize(int device_num,
                                     const char *type,
                                     ompt_device_t *device,
                                     ompt_function_lookup_t lookup,
                                     const char *documentation)
{
    printf("[%s] device_num = %d | type = %s\n", __FUNCTION__, device_num, type);
}

static int
initialize_tool( ompt_function_lookup_t lookup,
                 int                    initialDeviceNum,
                 ompt_data_t*           toolData )
{
    ompt_set_callback_t set_callback =
        ( ompt_set_callback_t )lookup( "ompt_set_callback" );
    assert( set_callback != 0 );

    ompt_set_result_t registration_result = set_callback(ompt_callback_device_initialize, (ompt_callback_t) &callback_ompt_device_initialize);
    assert(registration_result == ompt_set_always);

    return 1;
}

static void
finalize_tool( ompt_data_t* toolData )
{
}
	
ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
                 const char*  runtime_version )
{
    static ompt_start_tool_result_t tool = { &initialize_tool,
                                             &finalize_tool,
                                             ompt_data_none };
    return &tool;
}


int main(void) {
    #pragma omp target device(0)
    {}

    #pragma omp target device(4)
    {}
}

Running the tool on a system with a single NVIDIA MX550, we can see the following output:

$ clang --version
clang version 18.0.0 (https://github.com/llvm/llvm-project.git 4b383107fa7585bb5ecd7f03cab7800b33d1585a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/software/software/LLVM/git/bin
$ clang -fopenmp -fopenmp-targets=x86_64,nvptx64 reproducer.c
$ ./a.out
[callback_ompt_device_initialize] device_num = 0 | type = generic-64bit
[callback_ompt_device_initialize] device_num = 0 | type = sm_75

Previous discussion

Here are a few related comments by @mhalk and me regarding this issue:

#64487 (comment)

Let's say we would go with "added support": I added prints for omp_get_device_num and omp_get_num_devices. Also forced the target regions to use device number 3 (and 11, for the two-target-version).

Output for x86_64 only:
trunk/bin/clang -fopenmp -fopenmp-targets=x86_64 veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=4
omp_get_num_devices=4

Callback Init: device_num=3 type=unknown device=0x1485000 lookup=0x7f0bd39c8a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200368 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3
Output for x86_64 and amdgcn-amd-amdhsa (8 gfx90a GPUs present):
trunk/bin/clang -fopenmp -fopenmp-targets=x86_64,amdgcn-amd-amdhsa veccopy-ompt-target.c -o veccopy-ompt-target
./veccopy-ompt-target

omp_get_device_num=12
omp_get_num_devices=12

Callback Init: device_num=3 type=gfx90a device=0x12ae980 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x200378 device_addr:(nil) bytes:19208
[...]
Callback Init: device_num=3 type=unknown device=0x12b52a0 lookup=0x7f441a241a90 doc=(nil)
Callback Load: device_num:3 module_id:0 filename:(null) host_adddr:0x204f00 device_addr:(nil) bytes:17112
[...]
Callback Fini: device_num=3
Callback Fini: device_num=3
So, in the sources, I intentionally used devices 3 and 11 for target regions. Which both correlate to device number 3 in their respective RTL. That is, to illustrate that there might be ambiguous output w.r.t. Fini callbacks.

#64487 (comment)

Judging by your output for both x86_64 and a combination of x86_64 + amdgcn, I would expect that nvptx64 + amdgcn yields the same results shown in your second example. I think that I can test this. For tool developers, the ambiguous output can be hard to work with, since we can only use the device_num to identify the executing device during target callbacks. The runtime should probably dispatch the same device number users can use to define the executing device (here 3 for the AMD GPU and 11 for the host).

I checked how the device numbers are delivered on a system with an AMD + NVIDIA GPU. We can see the same behavior:
$ clang-18 -fopenmp -fopenmp-targets=nvptx64,amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx90a veccopy.c
$ ./a.out
Num devices = 3
Device 0
Callback Init: device_num=0 type=sm_80 device=0x556d3f652a90 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3ed12778 device_addr:(nil) bytes:715888
[...]
Device 1
Callback Init: device_num=0 type=gfx90a device=0x556d402203d0 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:0 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Device 2
Callback Init: device_num=1 type=gfx90a device=0x556d40229210 lookup=0x7efd6a8027b0 doc=(nil)
Callback Load: device_num:1 filename:(null) host_adddr:0x556d3edc1478 device_addr:(nil) bytes:25064
[...]
Success
Callback Fini: device_num=0
Callback Fini: device_num=0
Callback Fini: device_num=1

#64487 (comment)

The runtime should probably dispatch the same device number users can use to define the executing device [...]

tl;dr: Agreed. Like that idea, albeit I'd have to see if this is possible. Maybe I'm not aware of some information but ATM I'm not very confident this can be (reasonably) solved on my end.

When the callbacks are dispatched, we only have the information from the corresponding RTL. To adhere to our example, that is e.g. DeviceId=3 -- while in the OpenMP runtime this might in fact be device number 11. Now the callback is executed, that's it -- for the init callback we could (during execution of the corresponding callback!) find out the "actual" (whatever that means) device number since we have the kind (amdgcn, nvptx, ...) and might be able to deduce further info. But other callbacks do not have this information.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-08-30T14:06:17Z

@llvm/issue-subscribers-openmp

Fixes: llvm#65104 When a user assigns devices to target regions it may happen that different identifiers will map onto the same id within different plugins. This will lead to situations where callbacks will become much harder to read, as ambiguous identifiers are reported. We fix this by collecting the index-offset upon general RTL initialization. Which in turn, allows to calculate the unique, user-observable device id.

Fixes: #65104 When a user assigns devices to target regions it may happen that different identifiers will map onto the same id within different plugins. This will lead to situations where callbacks will become much harder to read, as ambiguous identifiers are reported. We fix this by collecting the index-offset upon general RTL initialization. Which in turn, allows to calculate the unique, user-observable device id.

…#65595) Fixes: llvm#65104 When a user assigns devices to target regions it may happen that different identifiers will map onto the same id within different plugins. This will lead to situations where callbacks will become much harder to read, as ambiguous identifiers are reported. We fix this by collecting the index-offset upon general RTL initialization. Which in turn, allows to calculate the unique, user-observable device id.

github-actions bot added the new issue label Aug 30, 2023

EugeneZelenko added openmp and removed new issue labels Aug 30, 2023

dhruvachak assigned mhalk Aug 30, 2023

mhalk mentioned this issue Sep 7, 2023

[OpenMP][OMPT] Fix device identifier collision during callbacks #65595

Merged

mhalk closed this as completed in #65595 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OMPT] Overlapping `device_num` when using multiple offloading architectures #65104

[OMPT] Overlapping `device_num` when using multiple offloading architectures #65104

Thyre commented Aug 30, 2023

llvmbot commented Aug 30, 2023

[OMPT] Overlapping device_num when using multiple offloading architectures #65104

[OMPT] Overlapping device_num when using multiple offloading architectures #65104

Comments

Thyre commented Aug 30, 2023

Description

Reproducer

Previous discussion

llvmbot commented Aug 30, 2023

[OMPT] Overlapping `device_num` when using multiple offloading architectures #65104

[OMPT] Overlapping `device_num` when using multiple offloading architectures #65104