Do not post summary table to PR if both categories are not relevant #1270

Kobzol · 2022-04-02T09:55:39Z

Without this check, something like this could happen, which seems confusing to readers. Nothing was relevant, yet the table was shown.

nnethercote · 2022-04-02T11:15:53Z

What does "relevant" mean? That's a third categorization after "significant" and "dodgy", I'm totally confused.

Kobzol · 2022-04-02T12:23:07Z

So this is my understanding:

A test comparison (basically two values - before, after for a specific benchmark) is said to be significant if its's relative change is equal or over the significance threshold of this benchmark. The threshold is computed from the historical data, it uses variance and IQR to find if the result is over the threshold or not.
The magnitude of a comparison can be very small, small, medium, large or very large. Again, the magnitude is computed from the relative change and the significance threshold.
The summary used here (ComparisonSummary) only considers significant comparisons that have magnitude at least small. All other results are ignored. If no such comparisons were found, we say that "no relevant changes" were found and the analysis ends.

After this first pass is performed, an additional check is done.

The confidence of the whole benchmark category (previously of the whole benchmark run) is computed. The confidence can be one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant. It is computed by counting the amount of various magnitudes of the comparisons. For example, if there is at least a single large (or larger) change, the confidence will be DefinitelyRelevant.
Then, we check whether the confidence of the whole category is relevant or not. If the perf. run was for a master commit, we consider it to be relevant if it's DefinitelyRelevant. If the perf. run was for a PR, we consider it to be relevant if it's DefinitelyRelevant or ProbablyRelevant.
If we found that the whole category is not relevant, we say "no relevant changes found. XYZ results were statistically significant but too small to be relevant".
If we found that the whole category is relevant, we show e.g. "🎉 relevant improvements found".

Before, the summary table was shown even if neither category was found to be relevant. After this PR, the summary table will be shown only if at least one of the categories was found to be relevant.

nnethercote · 2022-04-04T06:43:03Z

Thanks for the detailed description. IIUC:

Significance and magnitude apply to a single benchmark run (i.e. a pair of before/after measurements).
Confidence applies to all the runs within a category, i.e. it's an aggregate result. Though, a single result that is significant and has a magnitude >= small could be considered relevant? So maybe it has both single and aggregate meanings?
Dodginess is ???

Some observations and questions:

If confidence is one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant, perhaps it should be called relevance instead?
Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.
"the magnitude is computed from the relative change and the significance threshold." Why is the significance threshold involved here? It seems to me like it should not be involved.

nnethercote · 2022-04-04T06:53:04Z

I also wonder if "interesting" is a better word than "relevant". It's meaning seems more obvious, more directly indicating "worth paying attention to".

Kobzol · 2022-04-04T07:20:07Z

Well, those are some good questions :D But I'm probably not the one to answer them.

Confidence applies to all the runs within a category, i.e. it's an aggregate result. Though, a single result that is significant and has a magnitude >= small could be considered relevant? So maybe it has both single and aggregate meanings?

I would say that confidence only has an aggregate meaning. It's just that if a group of results contains some results with a very large magnitude, the whole group will be considered to be relevant (yeah the names should be better aligned here).

Dodginess is ???

/// Whether we can trust this benchmark or not
fn is_dodgy(&self) -> bool {
    // If changes are judged significant only exceeding 0.2%, then the
    // benchmark as a whole is dodgy.
    self.significance_threshold() * 100.0 > 0.2
}

If confidence is one of MaybeRelevant, ProbablyRelevant or DefinitelyRelevant, perhaps it should be called relevance instead?

I agree.

Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.

CC @rylev, who introduced the change AFAIK.

"the magnitude is computed from the relative change and the significance threshold." Why is the significance threshold involved here? It seems to me like it should not be involved.

Well, you need some historical data to judge whether a change is of small or large magnitude I suppose. Using hardcoded constraints probably wouldn't work for all benchmarks. This historical data is provided by the significance threshold, which is computed from historical variance.

rylev · 2022-04-04T07:50:54Z

First, I'd like to note that we have a glossary for these terms (#1266 makes some changes to this glossary). I'm always in favor of simplifying, but we're likely always going to have a need for a definition of terms, so I'd like to keep the glossary up to date if possible.

the names should be better aligned here

100% agreement here 😊

Why are relevances treated differently for PRs vs master commits? Seems to me like they should be treated the same.

We've come a long way in our confidence in our automatic classification of perf runs as "interesting" (using the newly suggested term instead of "relevant"). It used to be a lot harder to tell whether a run was "interesting" or not. At the time it was decided that for try runs, we should be extra picky since there was an explicit request asking whether a particular change introduces performance changes. Runs on master commits instead are less picky - we wanted to be 100% sure that we would inform people of a performance change only if we were definitely sure there were real performance changes.

Well, you need some historical data to judge whether a change is of small or large magnitude I suppose

This is correct.

Do not post summary table to PR if both categories are not significant

3810213

Kobzol requested a review from rylev April 2, 2022 09:55

rylev approved these changes Apr 4, 2022

View reviewed changes

rylev merged commit 8ba8456 into master Apr 4, 2022

rylev deleted the summary-relevance branch April 4, 2022 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not post summary table to PR if both categories are not relevant #1270

Do not post summary table to PR if both categories are not relevant #1270

Kobzol commented Apr 2, 2022

nnethercote commented Apr 2, 2022

Kobzol commented Apr 2, 2022

nnethercote commented Apr 4, 2022

nnethercote commented Apr 4, 2022

Kobzol commented Apr 4, 2022

rylev commented Apr 4, 2022

Do not post summary table to PR if both categories are not relevant #1270

Do not post summary table to PR if both categories are not relevant #1270

Conversation

Kobzol commented Apr 2, 2022

nnethercote commented Apr 2, 2022

Kobzol commented Apr 2, 2022

nnethercote commented Apr 4, 2022

nnethercote commented Apr 4, 2022

Kobzol commented Apr 4, 2022

rylev commented Apr 4, 2022