-
Notifications
You must be signed in to change notification settings - Fork 43
[sled-agent] Improve Error Messages! #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So I realized something kinda interesting while sorting through these errors - "who is responsible for reporting context" has been fairly unclear. To be more specific: If module This seems like a classic "caller vs callee" saving convention. For my in-progress PR addressing errors, I am opting to put this responsibility on the callee. In doing so, errors can be more liberally propagated upwards, as their error context should be fairly "self-contained". This choice was partially motivated out of frustration with errors like let f = File::open("foo.txt")?; Can propagate information without including arguments, which makes debugging a pain. In this case, an error could be seen as "OS error: not found" without actually indicating what file was not found. |
For what it's worth, I like this pattern and I've been trying to do that as well. There's a DRY argument for it too: why have every caller construct that context (in different ways, most likely) when the callee has the information to do it? I'm also a big fan of each layer of the call stack adding its own context. I too wish
With the idea being that whatever kind of error |
I think the big question to me would be: "What's the new type of the error?" - this seems like the most prominent difficulty to me. TL;DR: I really like that "extra context" API, but I also like using fine-grained error types, and providing a trait to add context that is workable across many different error types seems both handy + somewhat tricky to do. |
@smklein wrote:
Yeah, I think that's true. I was envisioning it would return the exact same type (and variant) of error. That's critical -- in order for people to be able to add this kind of context freely, doing so needs to preserve the semantics and type signature of their function. I think you're right that doing this is more challenging with more fine-grained error types and I think that's a cost to be considered in deciding whether to use more fine-grained error types. @david-crespo wrote:
(This is starting to get pretty far away from just producing better error messages -- sorry for starting this here.) Yeah, there's some ugliness with using a big enum for a bunch of different interfaces, some of which are pretty unrelated to each other. I've heard the advice of creating specialized error types for different interfaces, which can lead to more robust code because the type actually expresses what can really happen. But I've always seen this presented in pretty simple examples, with a specific function or two with a couple of failure modes each. When we have that situation, I agree that we should use more fine-grained error types (e.g., I'm less clear on how to apply this idea to a system like Nexus with lots of functions in different subsystems that can all produce many of the same errors that are essential to handle correctly (at least: authn/authz failures, the "internal server error" case (i.e., "my invariants were violated by something outside my control") and "service unavailable" (i.e., "I can't do this right now because something I depend on isn't working")). I'm a little worried that with fine-grained error types, people will tend to bucket these into a generic Moreover, the benefit of using fine-grained error types is that the type accurately describes what can happen, which is useful to a degree. But outside of the simple cases, there's almost always an explicit list of handled errors and everything else is propagated, retried, or otherwise handled in a generic way. What concrete problem would a more fine-grained type solve there? There's also value in being able to talk about errors uniformly, in terms of semantics (e.g., "this was the caller's fault" vs. "this was the server's fault"), metrics (e.g., "how many 503-like errors did we see, regardless of whether they were from API requests"), the context item I mentioned above, etc. This doesn't necessitate using the exact same type but it's a convenient way to do it. |
Fixes #966 General principles: - Errors should be "self-contained" as much as possible - if a function returns an error, the error alone should provide enough context to identify "where and why the error happened". - For externally-defined errors, this often means wrapping the error with a string message, name, or other identifying information - usage of `error(transparent)` and `thiserror`'s `#[from]` has been reduced significantly. These attributes *are* permitted on "high-level" aggregation services, but only when operating on errors that are sufficiently self-describing.
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
@jgallagher brought up a good point last week, that our errors in Omicron (often making use of
thiserror
, to merely propagate errors) lose a fair bit of context.This issue tracks auditing error messages for the following criteria:
Often, merely attaching an auxiliary "message string" is sufficient to uncover this context.
To be audited (this is serving as a checklist for me as I work through):
The text was updated successfully, but these errors were encountered: