-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grant/kol 7429 programmatic retrieval of significance #705
grant/kol 7429 programmatic retrieval of significance #705
Conversation
""" | ||
|
||
|
||
def z_score_to_probability(z: float) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied and rewritten from kolena FE's implementation of the same functionality
from kolena.errors import IncorrectUsageError | ||
|
||
|
||
class PerformanceDelta(str, enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these serialize well to CSV? Will users ever be expected to know about PerformanceDelta
? In general, for anything that a user deals with, I'd prefer to use something like
PerformanceDelta = Literal["improved", "regressed", "similar", "unknown"]
So that the user doesn't need to know about some random enum type, and can instead just work with things as strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm export to CSV works fine, but will still change to Literal so the user don't have to see the enum
Can you share an example of how to run this? I'm trying to figure out how to best set up a dataset / QS for this -- is there an example that you already have set up that I can play with? |
Yes, I have played around with this dataset on my tenant
|
:param reference_eval_config: The evaluation configuration to use in conjunction with the reference model, | ||
if unspecified the first model in the models list will be used | ||
as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None, | ||
the first eval config will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param reference_eval_config: The evaluation configuration to use in conjunction with the reference model, | |
if unspecified the first model in the models list will be used | |
as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None, | |
the first eval config will be used. | |
:param reference_eval_config: The evaluation configuration to use in conjunction with the `reference_model`. | |
If this is `None`, the first eval config will be used. If you wish to use an evaluation configuration of `None`, please instead specify the configuration as `'null'`. |
Also, seems a bit odd to make the user specify null
instead of using None
-- would it make sense to instead define it as:
reference_eval_config: Optional[Union[Dict[str, Any], Literal["first"]]] = "first",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep make sense, I was just trying to distinguish between null and None
if reference_model == model and not reference_eval_config: | ||
reference_eval_config = eval_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the first eval_config
will be skipped if it is None
?
# initially reference_model is None and reference_eval_config
# First pass, model is 'foo' and eval_config is None
reference_model = 'foo'
reference_eval_config = None
# Second pass, model is 'foo' and eval_config is {'bar': 'baz'}
reference_eval_config = {'bar': 'baz'}
Also, does this get handled safely if the only eval config for the reference model is None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally I was expecting user to pass in "null" so it won't be ignored, I will have to handle this case now. This should work as expected when the only eval config is None, as the dataset I used to test is like this
I tried playing around with this, and it looks like the colors aren't matching with the webapp (link) One example (from when I dumped this to CSV):
The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95) |
Weird, the algorithm is exactly the same, I will calculate it again to confirm. Could it be that the color on webapp is too light to see |
…ality_standard_result
Co-authored-by: Andrew Shi <andrew@kolena.io>
Co-authored-by: Andrew Shi <andrew@kolena.io>
Co-authored-by: Andrew Shi <andrew@kolena.io>
9a105d4
to
b5f4e6b
Compare
this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result. |
What was the issue, the check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock, pending test to track the latest (now fixed) issue.
I missed the part where the max value for performance delta is the max across the stratification, except for when the max value is 0 or less. |
created this ticket to track |
Linked issue(s)
What change does this PR introduce and why?
Please check if the PR fulfills these requirements