Skip to content

Do not post summary table to PR if both categories are not relevant #1270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2022

Conversation

Kobzol
Copy link
Contributor

@Kobzol Kobzol commented Apr 2, 2022

Without this check, something like this could happen, which seems confusing to readers. Nothing was relevant, yet the table was shown.

@Kobzol Kobzol requested a review from rylev April 2, 2022 09:55
@nnethercote
Copy link
Contributor

What does "relevant" mean? That's a third categorization after "significant" and "dodgy", I'm totally confused.

@Kobzol
Copy link
Contributor Author

Kobzol commented Apr 2, 2022

So this is my understanding:

  • A test comparison (basically two values - before, after for a specific benchmark) is said to be significant if its's relative change is equal or over the significance threshold of this benchmark. The threshold is computed from the historical data, it uses variance and IQR to find if the result is over the threshold or not.
  • The magnitude of a comparison can be very small, small, medium, large or very large. Again, the magnitude is computed from the relative change and the significance threshold.
  • The summary used here (ComparisonSummary) only considers significant comparisons that have magnitude at least small. All other results are ignored. If no such comparisons were found, we say that "no relevant changes" were found and the analysis ends.

After this first pass is performed, an additional check is done.

  • The confidence of the whole benchmark category (previously of the whole benchmark run) is computed. The confidence can be one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant. It is computed by counting the amount of various magnitudes of the comparisons. For example, if there is at least a single large (or larger) change, the confidence will be DefinitelyRelevant.
  • Then, we check whether the confidence of the whole category is relevant or not. If the perf. run was for a master commit, we consider it to be relevant if it's DefinitelyRelevant. If the perf. run was for a PR, we consider it to be relevant if it's DefinitelyRelevant or ProbablyRelevant.
  • If we found that the whole category is not relevant, we say "no relevant changes found. XYZ results were statistically significant but too small to be relevant".
  • If we found that the whole category is relevant, we show e.g. "🎉 relevant improvements found".

Before, the summary table was shown even if neither category was found to be relevant. After this PR, the summary table will be shown only if at least one of the categories was found to be relevant.

@nnethercote
Copy link
Contributor

Thanks for the detailed description. IIUC:

  • Significance and magnitude apply to a single benchmark run (i.e. a pair of before/after measurements).
  • Confidence applies to all the runs within a category, i.e. it's an aggregate result. Though, a single result that is significant and has a magnitude >= small could be considered relevant? So maybe it has both single and aggregate meanings?
  • Dodginess is ???

Some observations and questions:

  • If confidence is one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant, perhaps it should be called relevance instead?
  • Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.
  • "the magnitude is computed from the relative change and the significance threshold." Why is the significance threshold involved here? It seems to me like it should not be involved.

@nnethercote
Copy link
Contributor

I also wonder if "interesting" is a better word than "relevant". It's meaning seems more obvious, more directly indicating "worth paying attention to".

@Kobzol
Copy link
Contributor Author

Kobzol commented Apr 4, 2022

Well, those are some good questions :D But I'm probably not the one to answer them.

Confidence applies to all the runs within a category, i.e. it's an aggregate result. Though, a single result that is significant and has a magnitude >= small could be considered relevant? So maybe it has both single and aggregate meanings?

I would say that confidence only has an aggregate meaning. It's just that if a group of results contains some results with a very large magnitude, the whole group will be considered to be relevant (yeah the names should be better aligned here).

Dodginess is ???

/// Whether we can trust this benchmark or not
fn is_dodgy(&self) -> bool {
    // If changes are judged significant only exceeding 0.2%, then the
    // benchmark as a whole is dodgy.
    self.significance_threshold() * 100.0 > 0.2
}

If confidence is one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant, perhaps it should be called relevance instead?

I agree.

Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.

CC @rylev, who introduced the change AFAIK.

"the magnitude is computed from the relative change and the significance threshold." Why is the significance threshold involved here? It seems to me like it should not be involved.

Well, you need some historical data to judge whether a change is of small or large magnitude I suppose. Using hardcoded constraints probably wouldn't work for all benchmarks. This historical data is provided by the significance threshold, which is computed from historical variance.

@rylev
Copy link
Member

rylev commented Apr 4, 2022

First, I'd like to note that we have a glossary for these terms (#1266 makes some changes to this glossary). I'm always in favor of simplifying, but we're likely always going to have a need for a definition of terms, so I'd like to keep the glossary up to date if possible.

the names should be better aligned here

100% agreement here 😊

Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.

We've come a long way in our confidence in our automatic classification of perf runs as "interesting" (using the newly suggested term instead of "relevant"). It used to be a lot harder to tell whether a run was "interesting" or not. At the time it was decided that for try runs, we should be extra picky since there was an explicit request asking whether a particular change introduces performance changes. Runs on master commits instead are less picky - we wanted to be 100% sure that we would inform people of a performance change only if we were definitely sure there were real performance changes.

Well, you need some historical data to judge whether a change is of small or large magnitude I suppose

This is correct.

@rylev rylev merged commit 8ba8456 into master Apr 4, 2022
@rylev rylev deleted the summary-relevance branch April 4, 2022 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants