Skip to content

Capture SP task dumps in support bundles #8177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

wfchandler
Copy link
Contributor

Update the support bundle collector to capture task dumps from the SPs.

@wfchandler wfchandler force-pushed the wc/bundle-sp-dumps branch from 7380a0e to a4b0acf Compare May 16, 2025 14:49
Update the support bundle collector to capture task dumps from the SPs.
@wfchandler wfchandler force-pushed the wc/bundle-sp-dumps branch from a4b0acf to dbb6669 Compare May 18, 2025 23:05
@wfchandler
Copy link
Contributor Author

Tested on Dublin:

# Create a task dump
root@oxz_switch0:/tmp# /tmp/humility-0.12.0-probless -a /tmp/build-gimlet-image-e.zip --ip fe80::aa40:25ff:fe04:604%gimlet15 dump --task control_plane_agent
humility: connecting to fe80::aa40:25ff:fe04:604%51
humility: using UDP dump agent
humility: pulled 10.12KB in 0 seconds
humility: dumping to "hubris.core.control_plane_agent.0"
humility: dumped 661.38KB in 0 seconds

# Confirm the dump is present on the SP
root@oxz_switch0:/tmp# /tmp/humility-0.12.0-probless -a /tmp/build-gimlet-image-e.zip --ip fe80::aa40:25ff:fe04:604%gimlet15 dump -l
humility: connecting to fe80::aa40:25ff:fe04:604%51
humility: using UDP dump agent
AREA TASK                  TIME       SIZE
   0 control_plane_agent   12942213   10368

# Contents of original task dump
root@oxz_switch0:/tmp# /tmp/humility-0.12.0-probless -d hubris.core.control_plane_agent.0 tasks
humility: attached to dump
system time = 12942213
ID TASK                       GEN PRI STATE    
19 control_plane_agent          0   7 recv, notif: usart-irq(irq37) socket timer

# Create a support bundle
root@oxz_switch0:/tmp# omdb -w nexus sb create
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:103::4]:12221
created support bundle: 39bff2f6-15c2-4f6e-93ba-c832380816e6

# Download the dehydrated task dump from the bundle
root@oxz_switch0:/tmp# omdb nexus sb get-file 39bff2f6-15c2-4f6e-93ba-c832380816e6 sp_task_dumps/sled_15/dump-0.zip > /tmp/dump-0.zip

# Rehydrate the dump from the archive
root@oxz_switch0:/tmp# /tmp/humility-0.12.0-probless --archive /tmp/build-gimlet-image-e.zip hydrate dump-0.zip 
humility: read dehydrated crash dump
humility:   task index: 19
humility:   crash time: 12942213
humility:   archive id: [d5, 3a, 8f, a4, c9, b5, 6f, 91]
humility:   board:      gimlet-e
humility:   git commit: 6edf9b5e6aa5c928a5462bda1f7a4c6f3caa40ab
humility:   version:    1.0.37
humility:   2 memory regions:
humility:     0x24001108: 168 bytes
humility:     0x24030000: 32768 bytes
humility: dumping to "hubris.core.control_plane_agent.1"
humility: dumped 661.38KB in 0 seconds

# Confirm we have the same tasks
root@oxz_switch0:/tmp# /tmp/humility-0.12.0-probless -d hubris.core.control_plane_agent.1 tasks
humility: attached to dump
system time = 12942213
ID TASK                       GEN PRI STATE    
19 control_plane_agent          0   7 recv, notif: usart-irq(irq37) socket timer

# Rehydrated file matches the original
root@oxz_switch0:/tmp# shasum hubris.core.control_plane_agent.*
00a066ce8b911b09579648847650b096dd594831  hubris.core.control_plane_agent.0
00a066ce8b911b09579648847650b096dd594831  hubris.core.control_plane_agent.1

@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer {
task_impl: Box::new(
support_bundle_collector::SupportBundleCollector::new(
datastore.clone(),
resolver.clone(),
Copy link
Contributor Author

@wfchandler wfchandler May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this isn't the best approach, given the conversation in the control plane huddle today. On the other hand, we're very unlikely to be collecting bundles fast enough to have a real impact.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically

@wfchandler wfchandler marked this pull request as ready for review May 20, 2025 20:55
@wfchandler wfchandler requested review from papertigers and smklein and removed request for papertigers May 20, 2025 20:55
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| {
format!("failed to create SP task dump directory {sp_dumps_dir}")
})?;
let sp_dumps_fut =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should strongly consider modifying SupportBundleCollectionReport in nexus/types/src/internal_api/background.rs to identify whether or not collection of the SP dumps was successful or not.

Comment on lines 625 to 628
let sp_dumps_dir = dir.path().join("sp_task_dumps");
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| {
format!("failed to create SP task dump directory {sp_dumps_dir}")
})?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks replicated from above? We should define this variable / create_dir_all in one spot, perhaps?

@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer {
task_impl: Box::new(
support_bundle_collector::SupportBundleCollector::new(
datastore.clone(),
resolver.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically

.context("failed to get list of SPs from MGS")?
.into_inner();

let mut futures = futures::stream::iter(all_sps.into_iter())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend checking out parallel-task-set, which we added relatively recently, to perform this task-saving operation in parallel.

(It should be quite similar to using buffer_unordered, but it'll actually spawn tokio tasks for each SP - but we can still set a limit on the maximum amount of parallelism)

sp_dumps_dir: &Utf8Path,
) -> anyhow::Result<()> {
let dump_count = mgs_client
.sp_task_dump_count(sp.type_, sp.slot)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When / how are these task dumps deleted? Just trying to understand if there's a TOCTTOU issue between "get # of tasks" vs "iterate over them"

@wfchandler wfchandler requested a review from smklein May 21, 2025 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants