-
Notifications
You must be signed in to change notification settings - Fork 152
Do not post summary table to PR if both categories are not relevant #1270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What does "relevant" mean? That's a third categorization after "significant" and "dodgy", I'm totally confused. |
So this is my understanding:
After this first pass is performed, an additional check is done.
Before, the summary table was shown even if neither category was found to be relevant. After this PR, the summary table will be shown only if at least one of the categories was found to be relevant. |
Thanks for the detailed description. IIUC:
Some observations and questions:
|
I also wonder if "interesting" is a better word than "relevant". It's meaning seems more obvious, more directly indicating "worth paying attention to". |
Well, those are some good questions :D But I'm probably not the one to answer them.
I would say that confidence only has an aggregate meaning. It's just that if a group of results contains some results with a very large magnitude, the whole group will be considered to be relevant (yeah the names should be better aligned here).
/// Whether we can trust this benchmark or not
fn is_dodgy(&self) -> bool {
// If changes are judged significant only exceeding 0.2%, then the
// benchmark as a whole is dodgy.
self.significance_threshold() * 100.0 > 0.2
}
I agree.
CC @rylev, who introduced the change AFAIK.
Well, you need some historical data to judge whether a change is of small or large magnitude I suppose. Using hardcoded constraints probably wouldn't work for all benchmarks. This historical data is provided by the significance threshold, which is computed from historical variance. |
First, I'd like to note that we have a glossary for these terms (#1266 makes some changes to this glossary). I'm always in favor of simplifying, but we're likely always going to have a need for a definition of terms, so I'd like to keep the glossary up to date if possible.
100% agreement here 😊
We've come a long way in our confidence in our automatic classification of perf runs as "interesting" (using the newly suggested term instead of "relevant"). It used to be a lot harder to tell whether a run was "interesting" or not. At the time it was decided that for try runs, we should be extra picky since there was an explicit request asking whether a particular change introduces performance changes. Runs on master commits instead are less picky - we wanted to be 100% sure that we would inform people of a performance change only if we were definitely sure there were real performance changes.
This is correct. |
Without this check, something like this could happen, which seems confusing to readers. Nothing was relevant, yet the table was shown.