Invalid queue handle on AMD GPUs using Mesa 23.3.3 and Mesa 24 #2465

Scrumplex · 2024-02-10T23:27:16Z

Template

If you dont understand something just leave it.
If you can provide more detailed information than the template allows for, please ignore the template and present all of your findings.

Version of vulkano: 94f50f1
OS: NixOS using nixos-unstable
GPU (the selected PhysicalDevice): AMD Radeon RX 6800 XT (RADV NAVI21)
GPU Driver: Mesa 23.3.3
Upload of a reasonably minimal complete main.rs file that demonstrates the issue: TODO

Issue

I and other users are experiencing segfaults when trying to run release builds of https://github.com/galister/wlx-overlay-s on AMD GPUs on Mesa 23.3.3 or Mesa 24.0.0.

Interestingly, it works fine with debug builds.

For Mesa 23.3.3 the segmentation fault occurs in vk_common_QueueSubmit (_queue=0x0, submitCount=1, pSubmits=0x7ffffafd5ce0, fence=0x0) at ../src/vulkan/runtime/vk_synchronization2.c:294 (mesa source) which shows that the queue handle is apparently 0x0.

After adding a simple debug message to wlx-overlay-s, we can confirm this difference in behavior between debug and release buids:

Debug:

[2024-02-10T23:10:14Z INFO  wlx_overlay_s::graphics] build_and_execute_now queue: Some(Queue { handle: 0x5555577e4f60, device: 0x5555578356f0 (instance: 0x555557779f40), flags: empty(), queue_family_index: 0, id: 0, state: Mutex { data: QueueState } })

Release with debug info:

[2024-02-10T23:12:07Z INFO  wlx_overlay_s::graphics] build_and_execute_now queue: Some(Queue { handle: 0x0, device: 0x55555656b4a0 (instance: 0x5555564af100), flags: empty(), queue_family_index: 0, id: 0, state: Mutex { data: QueueState } })

Cargo.toml:

...

[profile.release]
debug = true

Backtrace for cargo build --release using stripped Mesa 23.3.3:

#0  0x00007ffff5dca11a in vk_common_QueueSubmit () from /nix/store/j0wrj87k3q9r8r6nc7imcfg5phk9gyz9-mesa-23.3.3-drivers/lib/libvulkan_radeon.so
#1  0x0000555555e1b16e in vulkano::device::queue::QueueGuard::submit_unchecked (self=<optimized out>, submit_infos=..., fence=...) at src/device/queue.rs:1154
#2  0x0000555555e1dcaf in vulkano::device::queue::QueueGuard::submit (self=0x7fffffff3eb0, submit_infos=&[vulkano::command_buffer::SubmitInfo](size=1) = {...}, fence=...) at src/device/queue.rs:739
#3  vulkano::sync::future::queue_submit::{closure#0} (queue_guard=...) at src/sync/future/mod.rs:749
#4  vulkano::device::queue::Queue::with<core::result::Result<(), vulkano::Validated<vulkano::VulkanError>>, vulkano::sync::future::queue_submit::{closure_env#0}> (self=0x7fffffff4a60, func=...) at src/device/queue.rs:107
#5  vulkano::sync::future::queue_submit (queue=0x7fffffff4a60, submit_info=..., fence=..., future=...) at src/sync/future/mod.rs:749
#6  0x0000555555799f75 in vulkano::command_buffer::traits::{impl#2}::flush<vulkano::sync::future::now::NowFuture> (self=0x7fffffff4a50) at /home/scrumplex/.cargo/git/checkouts/vulkano-cb672043253a6e8d/94f50f1/vulkano/src/command_buffer/traits.rs:215
#7  0x00005555558ed5ad in wlx_overlay_s::graphics::WlxCommandBuffer::build_and_execute_now (self=...) at src/graphics.rs:939
#8  0x0000555555785364 in wlx_overlay_s::state::AppState::from_graphics (graphics=Arc(strong=1, weak=0) = {...}) at src/state.rs:35
#9  0x000055555577c957 in wlx_overlay_s::backend::openxr::openxr_run (running=...) at src/backend/openxr/mod.rs:59
#10 0x0000555555890b3c in wlx_overlay_s::auto_run (running=Arc(strong=3, weak=0) = {...}) at src/main.rs:61
#11 wlx_overlay_s::main () at src/main.rs:42

The text was updated successfully, but these errors were encountered:

Rua · 2024-02-11T10:29:39Z

Very strange. This would imply that when Vulkano calls vkGetDeviceQueue to construct a queue object, the Vulkan driver is returning zero as the handle. I wonder why.

Scrumplex · 2024-02-11T10:43:36Z

It should be noted that we use OpenXR here, which is also giving us a queue handle (https://github.com/galister/wlx-overlay-s/blob/81168644166c270e43ae559b18799fd36375216b/src/graphics.rs#L254)
I am just not sure if vulkano is supposed to allow this and let the driver cause a segfault though

Rua · 2024-02-11T10:52:57Z

It is definitely not correct to create more than one Vulkano object from the same handle! Vulkano objects will always assume that they are the sole owner of their handles, and will not take into account things that happen outside of their control.

I did find a possible source of problems in Vulkano's current code, which I made #2466 for just now. Can you try out the Vulkano version in the PR and see if it fixes your issue?

galister · 2024-02-11T11:02:28Z

It should be noted that we use OpenXR here, which is also giving us a queue handle

OpenXR is not giving us a queue handle, OpenXR is giving us a VkDevice handle, which we pass into vulkano::device::Device::from_handle, which then creates the queue.

It's the same way as index_camera_passthrough does it, though I'm not sure if it's correct at all.

One notable mention from me is that not all users see a segfault, and those who see a segfault only see it on release builds.

Scrumplex · 2024-02-12T14:11:57Z

Building wlx-overlay-s with more conservative optimizations seems to work around this issue.

Adding the following to Cargo.toml:

[profile.release]
opt-level = 1

marc0246 · 2024-02-12T14:28:10Z

I see that vulkano's DeviceCreateInfo is missing the queues field, unlike your ash create info. That's against the safety contract of Device::from_handle, so that would be the first place I would look. Though I don't know why it would lead to such a strange outcome.

Rua · 2024-02-12T14:40:19Z

If queues is empty when calling Device::from_handle, then the returned iterator of queues is also supposed to be empty. So then where is the OP getting Vulkano Queue objects from?

marc0246 · 2024-02-12T15:15:10Z

I see that the safety contract of Instance::from_handle is violated for the same reason: the create infos must match.

I also see that you load the Vulkan library using both ash (Entry::load) and vulkano (VulkanLibrary::new()). That's going to result in 2 libraries being loaded, having different function pointers. You must instead only load the library on one side and pass the vkGetInstanceProcAddr function pointer when creating it on the other.

galister · 2024-02-12T15:48:25Z

Thanks so much for taking the time! I've fixed both the double-library issue as well as DeviceCreateInfo, but we're still seeing the same behavior of segfault with opt-level > 1.
I'm going to try and dig some more and let you know if I found something.

marc0246 · 2024-02-12T16:00:40Z

If optimizations play a role, that generally smells like (Rust) UB. Most commonly a UAF.

marc0246 · 2024-02-12T16:10:13Z

This is the UAF I think: https://github.com/galister/wlx-overlay-s/blob/800e4dd043b470fa3903ca8f5cec20468e284aa3/src/graphics.rs#L261

marc0246 · 2024-02-12T16:35:15Z

Did it work? It's a bit of a footgun that everything ash is Copy, since these UAFs are so easy to do unwittingly.

galister · 2024-02-12T16:38:38Z

Waiting for @Scrumplex to confirm.
For me it never segfaults, so testing has been a bit of a pain.

galister · 2024-02-12T19:07:49Z

@marc0246 that seems to have done the trick. thanks for the truckful of wisdom, i am eternally grateful.

marc0246 · 2024-02-12T19:09:44Z

That's great to hear!

yshui · 2024-02-28T01:57:24Z

This is the UAF I think: galister/wlx-overlay-s@800e4dd/src/graphics.rs#L261

~~Errr, why is this a UAF?~~

Ooo, ash converts & into pointers internally!? Just throw away the lifetime. This is madness.

Scrumplex mentioned this issue Feb 11, 2024

wlx-overlay-s doesn't show anything inside VR and logs a segfault in libvulkan_radeon.so in dmesg galister/wlx-overlay-s#6

Closed

Rua added type: bug status: needs investigation labels Feb 11, 2024

marc0246 removed type: bug status: needs investigation labels Feb 12, 2024

marc0246 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024

yshui mentioned this issue Feb 28, 2024

ash builder pattern is error prone ash-rs/ash#866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid queue handle on AMD GPUs using Mesa 23.3.3 and Mesa 24 #2465

Invalid queue handle on AMD GPUs using Mesa 23.3.3 and Mesa 24 #2465

Scrumplex commented Feb 10, 2024 •

edited

Loading

Rua commented Feb 11, 2024

Scrumplex commented Feb 11, 2024

Rua commented Feb 11, 2024

galister commented Feb 11, 2024 •

edited

Loading

Scrumplex commented Feb 12, 2024

marc0246 commented Feb 12, 2024

Rua commented Feb 12, 2024

marc0246 commented Feb 12, 2024

galister commented Feb 12, 2024

marc0246 commented Feb 12, 2024

marc0246 commented Feb 12, 2024

marc0246 commented Feb 12, 2024

galister commented Feb 12, 2024

galister commented Feb 12, 2024

marc0246 commented Feb 12, 2024

yshui commented Feb 28, 2024 •

edited

Loading

Invalid queue handle on AMD GPUs using Mesa 23.3.3 and Mesa 24 #2465

Invalid queue handle on AMD GPUs using Mesa 23.3.3 and Mesa 24 #2465

Comments

Scrumplex commented Feb 10, 2024 • edited Loading

Template

Issue

Rua commented Feb 11, 2024

Scrumplex commented Feb 11, 2024

Rua commented Feb 11, 2024

galister commented Feb 11, 2024 • edited Loading

Scrumplex commented Feb 12, 2024

marc0246 commented Feb 12, 2024

Rua commented Feb 12, 2024

marc0246 commented Feb 12, 2024

galister commented Feb 12, 2024

marc0246 commented Feb 12, 2024

marc0246 commented Feb 12, 2024

marc0246 commented Feb 12, 2024

galister commented Feb 12, 2024

galister commented Feb 12, 2024

marc0246 commented Feb 12, 2024

yshui commented Feb 28, 2024 • edited Loading

Scrumplex commented Feb 10, 2024 •

edited

Loading

galister commented Feb 11, 2024 •

edited

Loading

yshui commented Feb 28, 2024 •

edited

Loading