Skip to content

[sled-agent] Improve Error Messages! #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
19 of 20 tasks
smklein opened this issue Apr 24, 2022 · 5 comments · Fixed by #967
Closed
19 of 20 tasks

[sled-agent] Improve Error Messages! #966

smklein opened this issue Apr 24, 2022 · 5 comments · Fixed by #967
Assignees
Labels
Sled Agent Related to the Per-Sled Configuration and Management

Comments

@smklein
Copy link
Collaborator

smklein commented Apr 24, 2022

@jgallagher brought up a good point last week, that our errors in Omicron (often making use of thiserror, to merely propagate errors) lose a fair bit of context.

This issue tracks auditing error messages for the following criteria:

  • If an error is propagated, it it clear where it came from?
  • If an error is propagated, is it clear what caused it to fail?

Often, merely attaching an auxiliary "message string" is sufficient to uncover this context.

To be audited (this is serving as a checklist for me as I work through):

  • ./src/illumos/dladm.rs:pub enum Error {
  • ./src/illumos/zone.rs:pub enum Error {
  • ./src/illumos/zpool.rs:pub enum ParseError {
  • ./src/illumos/zpool.rs:pub enum Error {
  • ./src/illumos/addrobj.rs:pub enum Error {
  • ./src/illumos/running_zone.rs:pub enum Error {
  • ./src/illumos/mod.rs:pub enum ExecutionError {
  • ./src/illumos/zfs.rs:pub enum Error {
  • ./src/bootstrap/agent.rs:pub enum BootstrapError {
  • ./src/bootstrap/trust_quorum/error.rs:pub enum TrustQuorumError {
  • ./src/bootstrap/spdm/error.rs:pub enum SpdmError {
    • Ignoring intentionally; SPDM is planned to be deprecated
  • ./src/bootstrap/params.rs:pub enum SubnetError {
  • ./src/rack_setup/service.rs:pub enum SetupServiceError {
  • ./src/storage_manager.rs:pub enum Error {
  • ./src/instance.rs:pub enum Error {
  • ./src/updates.rs:pub enum Error {
  • ./src/services.rs:pub enum Error {
  • ./src/instance_manager.rs:pub enum Error {
  • ./src/config.rs:pub enum ConfigError {
  • ./src/sled_agent.rs:pub enum Error {
@smklein smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Apr 24, 2022
@smklein smklein self-assigned this Apr 24, 2022
@smklein
Copy link
Collaborator Author

smklein commented Apr 24, 2022

So I realized something kinda interesting while sorting through these errors - "who is responsible for reporting context" has been fairly unclear.

To be more specific: If module A calls an error-returning function in module B with arguments args, and something goes wrong, is it the responsibility of B to report the context (including any info from args) in an error, or it is up to A to attach this context as things go up the error chain?

This seems like a classic "caller vs callee" saving convention. For my in-progress PR addressing errors, I am opting to put this responsibility on the callee. In doing so, errors can be more liberally propagated upwards, as their error context should be fairly "self-contained".

This choice was partially motivated out of frustration with errors like std::io::Error, where calls such as:

let f = File::open("foo.txt")?;

Can propagate information without including arguments, which makes debugging a pain. In this case, an error could be seen as "OS error: not found" without actually indicating what file was not found.

@davepacheco
Copy link
Collaborator

To be more specific: If module A calls an error-returning function in module B with arguments args, and something goes wrong, is it the responsibility of B to report the context (including any info from args) in an error, or it is up to A to attach this context as things go up the error chain?

This seems like a classic "caller vs callee" saving convention. For my in-progress PR addressing errors, I am opting to put this responsibility on the callee. In doing so, errors can be more liberally propagated upwards, as their error context should be fairly "self-contained".

For what it's worth, I like this pattern and I've been trying to do that as well. There's a DRY argument for it too: why have every caller construct that context (in different ways, most likely) when the callee has the information to do it?

I'm also a big fan of each layer of the call stack adding its own context. I too wish File::open included the filename in the message. Then if you're in a wrapper function that's loading a config file specifically, it could add context "loading config", knowing the full error will say loading config: open "foo.txt": no such file or directory. If you're doing this while initializing a server, you could add "initializing server", knowing the lower layers will say what they're doing and the higher layers will wrap with more information.

anyhow's context() is really useful for this, but obviously only works for the generic anyhow errors. I've wondered about having something similar to anyhow::Context for our public Error type that prepends the string to the internal message of whatever error you've got. So you could do something like:

datastore
    .vpc_delete(opctx, &authz_vpc)
    .internal_context("deleting VPC because we're deleting the Project")?

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

@smklein
Copy link
Collaborator Author

smklein commented Apr 25, 2022

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

I think the big question to me would be: "What's the new type of the error?" - this seems like the most prominent difficulty to me. anyhow::Context is a sealed trait, which only supports conversion to the anyhow::Error concrete type. So, in a limited capacity: If we have commonly used errors (like in Nexus) this would be easy to add, but this seems more challenging to use in cases where we have fine-grained errors.

TL;DR: I really like that "extra context" API, but I also like using fine-grained error types, and providing a trait to add context that is workable across many different error types seems both handy + somewhat tricky to do.

@david-crespo
Copy link
Contributor

This feels related to #347 — especially the idea of each caller adding its own context as you move up the stack. On the Nexus side, the fact that datastore is returning external API error structs feels wrong. It looks from #967 that sled agent is doing it better.

@davepacheco
Copy link
Collaborator

@smklein wrote:

With the idea being that whatever kind of error vpc_delete() returns, the internal message is prepended with that string, similar to what anyhow::Context does. What do you think?

I think the big question to me would be: "What's the new type of the error?" - this seems like the most prominent difficulty to me. anyhow::Context is a sealed trait, which only supports conversion to the anyhow::Error concrete type. So, in a limited capacity: If we have commonly used errors (like in Nexus) this would be easy to add, but this seems more challenging to use in cases where we have fine-grained errors.

Yeah, I think that's true. I was envisioning it would return the exact same type (and variant) of error. That's critical -- in order for people to be able to add this kind of context freely, doing so needs to preserve the semantics and type signature of their function. I think you're right that doing this is more challenging with more fine-grained error types and I think that's a cost to be considered in deciding whether to use more fine-grained error types.

@david-crespo wrote:

This feels related to #347 — especially the idea of each caller adding its own context as you move up the stack. On the Nexus side, the fact that datastore is returning external API error structs feels wrong. It looks from #967 that sled agent is doing it better.

(This is starting to get pretty far away from just producing better error messages -- sorry for starting this here.)

Yeah, there's some ugliness with using a big enum for a bunch of different interfaces, some of which are pretty unrelated to each other. I've heard the advice of creating specialized error types for different interfaces, which can lead to more robust code because the type actually expresses what can really happen. But I've always seen this presented in pretty simple examples, with a specific function or two with a couple of failure modes each. When we have that situation, I agree that we should use more fine-grained error types (e.g., authn::Error).

I'm less clear on how to apply this idea to a system like Nexus with lots of functions in different subsystems that can all produce many of the same errors that are essential to handle correctly (at least: authn/authz failures, the "internal server error" case (i.e., "my invariants were violated by something outside my control") and "service unavailable" (i.e., "I can't do this right now because something I depend on isn't working")). I'm a little worried that with fine-grained error types, people will tend to bucket these into a generic SomethingThatDoesntHaveToDoWithMe variant that doesn't handle all of these properly. If you have some ideas for how this could look, that'd be really helpful.

Moreover, the benefit of using fine-grained error types is that the type accurately describes what can happen, which is useful to a degree. But outside of the simple cases, there's almost always an explicit list of handled errors and everything else is propagated, retried, or otherwise handled in a generic way. What concrete problem would a more fine-grained type solve there?

There's also value in being able to talk about errors uniformly, in terms of semantics (e.g., "this was the caller's fault" vs. "this was the server's fault"), metrics (e.g., "how many 503-like errors did we see, regardless of whether they were from API requests"), the context item I mentioned above, etc. This doesn't necessitate using the exact same type but it's a convenient way to do it.

smklein added a commit that referenced this issue May 5, 2022
Fixes #966

General principles:
- Errors should be "self-contained" as much as possible - if a function returns an error, the error alone should provide enough context to identify "where and why the error happened".
- For externally-defined errors, this often means wrapping the error with a string message, name, or other identifying information
- usage of `error(transparent)` and `thiserror`'s `#[from]` has been reduced significantly. These attributes *are* permitted on "high-level" aggregation services, but only when operating on errors that are sufficiently self-describing.
leftwo pushed a commit that referenced this issue Oct 4, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile
leftwo added a commit that referenced this issue Oct 5, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants