[sled-agent] Improve Error Messages! #966

smklein · 2022-04-24T16:23:19Z

smklein · 2022-04-24T23:22:44Z

So I realized something kinda interesting while sorting through these errors - "who is responsible for reporting context" has been fairly unclear.

To be more specific: If module A calls an error-returning function in module B with arguments args, and something goes wrong, is it the responsibility of B to report the context (including any info from args) in an error, or it is up to A to attach this context as things go up the error chain?

This seems like a classic "caller vs callee" saving convention. For my in-progress PR addressing errors, I am opting to put this responsibility on the callee. In doing so, errors can be more liberally propagated upwards, as their error context should be fairly "self-contained".

This choice was partially motivated out of frustration with errors like std::io::Error, where calls such as:

let f = File::open("foo.txt")?;

Can propagate information without including arguments, which makes debugging a pain. In this case, an error could be seen as "OS error: not found" without actually indicating what file was not found.

davepacheco · 2022-04-25T16:50:32Z

To be more specific: If module A calls an error-returning function in module B with arguments args, and something goes wrong, is it the responsibility of B to report the context (including any info from args) in an error, or it is up to A to attach this context as things go up the error chain?

This seems like a classic "caller vs callee" saving convention. For my in-progress PR addressing errors, I am opting to put this responsibility on the callee. In doing so, errors can be more liberally propagated upwards, as their error context should be fairly "self-contained".

For what it's worth, I like this pattern and I've been trying to do that as well. There's a DRY argument for it too: why have every caller construct that context (in different ways, most likely) when the callee has the information to do it?

I'm also a big fan of each layer of the call stack adding its own context. I too wish File::open included the filename in the message. Then if you're in a wrapper function that's loading a config file specifically, it could add context "loading config", knowing the full error will say loading config: open "foo.txt": no such file or directory. If you're doing this while initializing a server, you could add "initializing server", knowing the lower layers will say what they're doing and the higher layers will wrap with more information.

anyhow's context() is really useful for this, but obviously only works for the generic anyhow errors. I've wondered about having something similar to anyhow::Context for our public Error type that prepends the string to the internal message of whatever error you've got. So you could do something like:

datastore
    .vpc_delete(opctx, &authz_vpc)
    .internal_context("deleting VPC because we're deleting the Project")?

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

smklein · 2022-04-25T17:05:27Z

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

I think the big question to me would be: "What's the new type of the error?" - this seems like the most prominent difficulty to me. anyhow::Context is a sealed trait, which only supports conversion to the anyhow::Error concrete type. So, in a limited capacity: If we have commonly used errors (like in Nexus) this would be easy to add, but this seems more challenging to use in cases where we have fine-grained errors.

TL;DR: I really like that "extra context" API, but I also like using fine-grained error types, and providing a trait to add context that is workable across many different error types seems both handy + somewhat tricky to do.

david-crespo · 2022-04-25T18:26:16Z

This feels related to #347 — especially the idea of each caller adding its own context as you move up the stack. On the Nexus side, the fact that datastore is returning external API error structs feels wrong. It looks from #967 that sled agent is doing it better.

davepacheco · 2022-04-25T20:17:03Z

@smklein wrote:

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

I think the big question to me would be: "What's the new type of the error?" - this seems like the most prominent difficulty to me. anyhow::Context is a sealed trait, which only supports conversion to the anyhow::Error concrete type. So, in a limited capacity: If we have commonly used errors (like in Nexus) this would be easy to add, but this seems more challenging to use in cases where we have fine-grained errors.

Yeah, I think that's true. I was envisioning it would return the exact same type (and variant) of error. That's critical -- in order for people to be able to add this kind of context freely, doing so needs to preserve the semantics and type signature of their function. I think you're right that doing this is more challenging with more fine-grained error types and I think that's a cost to be considered in deciding whether to use more fine-grained error types.

@david-crespo wrote:

This feels related to #347 — especially the idea of each caller adding its own context as you move up the stack. On the Nexus side, the fact that datastore is returning external API error structs feels wrong. It looks from #967 that sled agent is doing it better.

(This is starting to get pretty far away from just producing better error messages -- sorry for starting this here.)

Yeah, there's some ugliness with using a big enum for a bunch of different interfaces, some of which are pretty unrelated to each other. I've heard the advice of creating specialized error types for different interfaces, which can lead to more robust code because the type actually expresses what can really happen. But I've always seen this presented in pretty simple examples, with a specific function or two with a couple of failure modes each. When we have that situation, I agree that we should use more fine-grained error types (e.g., authn::Error).

I'm less clear on how to apply this idea to a system like Nexus with lots of functions in different subsystems that can all produce many of the same errors that are essential to handle correctly (at least: authn/authz failures, the "internal server error" case (i.e., "my invariants were violated by something outside my control") and "service unavailable" (i.e., "I can't do this right now because something I depend on isn't working")). I'm a little worried that with fine-grained error types, people will tend to bucket these into a generic SomethingThatDoesntHaveToDoWithMe variant that doesn't handle all of these properly. If you have some ideas for how this could look, that'd be really helpful.

Moreover, the benefit of using fine-grained error types is that the type accurately describes what can happen, which is useful to a degree. But outside of the simple cases, there's almost always an explicit list of handled errors and everything else is propagated, retried, or otherwise handled in a generic way. What concrete problem would a more fine-grained type solve there?

There's also value in being able to talk about errors uniformly, in terms of semantics (e.g., "this was the caller's fault" vs. "this was the server's fault"), metrics (e.g., "how many 503-like errors did we see, regardless of whether they were from API requests"), the context item I mentioned above, etc. This doesn't necessitate using the exact same type but it's a convenient way to do it.

Fixes #966 General principles: - Errors should be "self-contained" as much as possible - if a function returns an error, the error alone should provide enough context to identify "where and why the error happened". - For externally-defined errors, this often means wrapping the error with a string message, name, or other identifying information - usage of `error(transparent)` and `thiserror`'s `#[from]` has been reduced significantly. These attributes *are* permitted on "high-level" aggregation services, but only when operating on errors that are sufficiently self-describing.

Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile

Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Apr 24, 2022

smklein self-assigned this Apr 24, 2022

smklein mentioned this issue Apr 25, 2022

[sled agent] Overhaul error handling to be more informative #967

Merged

smklein closed this as completed in #967 May 5, 2022

leftwo mentioned this issue Oct 4, 2023

Update Propolis and Crucible to latest #4195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] Improve Error Messages! #966

[sled-agent] Improve Error Messages! #966

smklein commented Apr 24, 2022 •

edited

Loading

smklein commented Apr 24, 2022

davepacheco commented Apr 25, 2022

smklein commented Apr 25, 2022

david-crespo commented Apr 25, 2022

davepacheco commented Apr 25, 2022

[sled-agent] Improve Error Messages! #966

[sled-agent] Improve Error Messages! #966

Comments

smklein commented Apr 24, 2022 • edited Loading

smklein commented Apr 24, 2022

davepacheco commented Apr 25, 2022

smklein commented Apr 25, 2022

david-crespo commented Apr 25, 2022

davepacheco commented Apr 25, 2022

smklein commented Apr 24, 2022 •

edited

Loading