-
Notifications
You must be signed in to change notification settings - Fork 43
Capture SP task dumps in support bundles #8177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7380a0e
to
a4b0acf
Compare
Update the support bundle collector to capture task dumps from the SPs.
a4b0acf
to
dbb6669
Compare
Tested on Dublin:
|
@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer { | |||
task_impl: Box::new( | |||
support_bundle_collector::SupportBundleCollector::new( | |||
datastore.clone(), | |||
resolver.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this isn't the best approach, given the conversation in the control plane huddle today. On the other hand, we're very unlikely to be collecting bundles fast enough to have a real impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| { | ||
format!("failed to create SP task dump directory {sp_dumps_dir}") | ||
})?; | ||
let sp_dumps_fut = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should strongly consider modifying SupportBundleCollectionReport
in nexus/types/src/internal_api/background.rs
to identify whether or not collection of the SP dumps was successful or not.
let sp_dumps_dir = dir.path().join("sp_task_dumps"); | ||
tokio::fs::create_dir_all(&sp_dumps_dir).await.with_context(|| { | ||
format!("failed to create SP task dump directory {sp_dumps_dir}") | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks replicated from above? We should define this variable / create_dir_all in one spot, perhaps?
@@ -505,6 +505,7 @@ impl BackgroundTasksInitializer { | |||
task_impl: Box::new( | |||
support_bundle_collector::SupportBundleCollector::new( | |||
datastore.clone(), | |||
resolver.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to punt on this until we sort out the progenitor integration with qorb more holistically
.context("failed to get list of SPs from MGS")? | ||
.into_inner(); | ||
|
||
let mut futures = futures::stream::iter(all_sps.into_iter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend checking out parallel-task-set
, which we added relatively recently, to perform this task-saving operation in parallel.
(It should be quite similar to using buffer_unordered
, but it'll actually spawn tokio tasks for each SP - but we can still set a limit on the maximum amount of parallelism)
sp_dumps_dir: &Utf8Path, | ||
) -> anyhow::Result<()> { | ||
let dump_count = mgs_client | ||
.sp_task_dump_count(sp.type_, sp.slot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When / how are these task dumps deleted? Just trying to understand if there's a TOCTTOU issue between "get # of tasks" vs "iterate over them"
Update the support bundle collector to capture task dumps from the SPs.