Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][HIP] Tracker ticket for flaky HIP CI issues #17464

Open
npmiller opened this issue Mar 14, 2025 · 1 comment
Open

[CI][HIP] Tracker ticket for flaky HIP CI issues #17464

npmiller opened this issue Mar 14, 2025 · 1 comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.

Comments

@npmiller
Copy link
Contributor

Describe the bug

This ticket tracks all of the other tickets and disabled tests related to the flaky CI issues on the AMD runner.

It is specifically focusing on the cases where one or multiple test is hanging in the same run as one or multiple test is failing with a memory access fault.

The issue has been worked around by limiting the AMD CI to run on a single thread, so it shouldn't happen anymore, but this ticket is to investigate the issue and help close the tickets and re-enable the tests once we have figured out the actual issue.

Tests with memory access fault

Experimental/launch_queries/max_work_group_size.cpp
FreeFunctionCommands/mem_advise.cpp
Regression/commandlist/gpu.cpp
WeakObject/weak_object_expired.cpp
Reduction/reduction_nd_ext_half.cpp
Reduction/reduction_big_data.cpp
Basic/built-ins/vec_relational.cpp
Basic/built-ins/vec_math.cpp
SubGroup/sub_group_as.cpp
HostInteropTask/host-task-dependency3.cpp
Reduction/reduction_big_data.cpp
SharedLib/use_with_dlopen_verify_cache.cpp

Tests hanging

WorkGroupMemory/basic_usage.cpp
syclcompat/memory/memory_management_test2_usmnone.cpp
SubGroup/reduce_spirv13.cpp
Adapters/retain_events.cpp
SubGroup/scan.cpp

List of related tickets

PR disabling related tests

Workaround PR with -j1

To reproduce

Environment

Additional context

No response

@npmiller npmiller added bug Something isn't working hip Issues related to execution on HIP backend. labels Mar 14, 2025
@npmiller
Copy link
Contributor Author

Quick update on this, the crashes can be reproduced by just running the AtomicRef/and_local.cpp a bunch of times in parallel.

It needs a bit more investigation but I believe I've also managed to reproduce it with a HIP application, so I suspect this isn't actually a SYCL problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

1 participant