grant/kol 7429 programmatic retrieval of significance #705

grant-Kolena · 2024-10-08T22:10:14Z

add functionality to display performance delta when using download_quality_standard_result
added test

Linked issue(s)

What change does this PR introduce and why?

Please check if the PR fulfills these requirements

Include reference to internal ticket and/or GitHub issue "Fixes #NNNN" (if applicable)
Relevant tests for the changes have been added
Relevant docs have been added / updated

kolena/_api/v2/_api.py

grant-Kolena · 2024-10-08T22:11:38Z

kolena/_experimental/utils.py

+"""
+
+
+def z_score_to_probability(z: float) -> float:


Copied and rewritten from kolena FE's implementation of the same functionality

munkyshi · 2024-10-09T17:55:58Z

kolena/_experimental/quality_standard.py

 from kolena.errors import IncorrectUsageError


+class PerformanceDelta(str, enum.Enum):


Do these serialize well to CSV? Will users ever be expected to know about PerformanceDelta? In general, for anything that a user deals with, I'd prefer to use something like

PerformanceDelta = Literal["improved", "regressed", "similar", "unknown"]

So that the user doesn't need to know about some random enum type, and can instead just work with things as strings.

I can confirm export to CSV works fine, but will still change to Literal so the user don't have to see the enum

munkyshi · 2024-10-09T20:47:30Z

Can you share an example of how to run this? I'm trying to figure out how to best set up a dataset / QS for this -- is there an example that you already have set up that I can play with?

grant-Kolena · 2024-10-09T20:49:28Z

Can you share an example of how to run this? I'm trying to figure out how to best set up a dataset / QS for this -- is there an example that you already have set up that I can play with?

Yes, I have played around with this dataset on my tenant

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

kolena/_experimental/quality_standard.py

munkyshi · 2024-10-09T20:30:37Z

kolena/_experimental/quality_standard.py

+    :param reference_eval_config: The evaluation configuration to use in conjunction with the reference model,
+     if unspecified the first model in the models list will be used
+     as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None,
+     the first eval config will be used.


Suggested change

:param reference_eval_config: The evaluation configuration to use in conjunction with the reference model,

if unspecified the first model in the models list will be used

as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None,

the first eval config will be used.

:param reference_eval_config: The evaluation configuration to use in conjunction with the `reference_model`.

If this is `None`, the first eval config will be used. If you wish to use an evaluation configuration of `None`, please instead specify the configuration as `'null'`.

Also, seems a bit odd to make the user specify null instead of using None -- would it make sense to instead define it as:

reference_eval_config: Optional[Union[Dict[str, Any], Literal["first"]]] = "first",

Yep make sense, I was just trying to distinguish between null and None

kolena/_api/v2/_metric.py

tests/integration/_experimental/test_quality_standard.py

kolena/_experimental/quality_standard.py

munkyshi · 2024-10-09T20:59:08Z

kolena/_experimental/quality_standard.py

+        if reference_model == model and not reference_eval_config:
+            reference_eval_config = eval_config


Does this mean that the first eval_config will be skipped if it is None?

# initially reference_model is None and reference_eval_config # First pass, model is 'foo' and eval_config is None reference_model = 'foo' reference_eval_config = None # Second pass, model is 'foo' and eval_config is {'bar': 'baz'} reference_eval_config = {'bar': 'baz'}

Also, does this get handled safely if the only eval config for the reference model is None?

originally I was expecting user to pass in "null" so it won't be ignored, I will have to handle this case now. This should work as expected when the only eval config is None, as the dataset I used to test is like this

munkyshi · 2024-10-09T21:33:57Z

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)

One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp

The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

grant-Kolena · 2024-10-09T21:38:41Z

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)

One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp

The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

Weird, the algorithm is exactly the same, I will calculate it again to confirm. Could it be that the color on webapp is too light to see

…ality_standard_result

Co-authored-by: Andrew Shi <andrew@kolena.io>

grant-Kolena · 2024-10-10T15:58:38Z

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)
One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp
The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

Weird, the algorithm is exactly the same, I will calculate it again to confirm. Could it be that the color on webapp is too light to see

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

munkyshi · 2024-10-10T22:05:25Z

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

What was the issue, the check for and max_value > 0:? Is there anything we can do to test around that case specifically, since it was something that was missed? Change looks good otherwise.

munkyshi

Approving to unblock, pending test to track the latest (now fixed) issue.

grant-Kolena · 2024-10-11T13:40:13Z

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

What was the issue, the check for and max_value > 0:? Is there anything we can do to test around that case specifically, since it was something that was missed? Change looks good otherwise.

I missed the part where the max value for performance delta is the max across the stratification, except for when the max value is 0 or less.

grant-Kolena · 2024-10-11T13:56:21Z

Approving to unblock, pending test to track the latest (now fixed) issue.

created this ticket to track

grant-Kolena requested a review from a team as a code owner October 8, 2024 22:10

grant-Kolena commented Oct 8, 2024

View reviewed changes

kolena/_api/v2/_api.py Show resolved Hide resolved

grant-Kolena commented Oct 8, 2024

View reviewed changes

munkyshi changed the title ~~grant/kol 7429 rad ai programmatic retrieval of significance~~ grant/kol 7429 programmatic retrieval of significance Oct 9, 2024

munkyshi reviewed Oct 9, 2024

View reviewed changes

grant-Kolena and others added 12 commits October 10, 2024 11:52

add functionality to display performance delta when using download_qu…

ab573f9

…ality_standard_result

added test

3ace4c6

fix ci

e0aa41a

fix ci

875ed60

fix ci

7289b8e

fix ci

b506971

really fix ci

4cae7af

respond to comments

aaa23e7

Update kolena/_experimental/quality_standard.py

f86b468

Co-authored-by: Andrew Shi <andrew@kolena.io>

Update tests/integration/_experimental/test_quality_standard.py

5c945c7

Co-authored-by: Andrew Shi <andrew@kolena.io>

Update kolena/_experimental/quality_standard.py

1c6d2dd

Co-authored-by: Andrew Shi <andrew@kolena.io>

fix delta percentage calculation

b5f4e6b

grant-Kolena force-pushed the grant/kol-7429-rad-ai-programmatic-retrieval-of-significance branch from 9a105d4 to b5f4e6b Compare October 10, 2024 15:52

fix ci

74efddd

grant-Kolena requested a review from munkyshi October 10, 2024 21:06

munkyshi approved these changes Oct 10, 2024

View reviewed changes

grant-Kolena merged commit 42ec3d2 into trunk Oct 11, 2024
57 checks passed

grant-Kolena deleted the grant/kol-7429-rad-ai-programmatic-retrieval-of-significance branch October 11, 2024 13:54

grant-Kolena added a commit that referenced this pull request Oct 11, 2024

grant/kol 7429 programmatic retrieval of significance (#705)

a7252ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grant/kol 7429 programmatic retrieval of significance #705

grant/kol 7429 programmatic retrieval of significance #705

grant-Kolena commented Oct 8, 2024

grant-Kolena Oct 8, 2024

munkyshi Oct 9, 2024

grant-Kolena Oct 9, 2024

munkyshi commented Oct 9, 2024

grant-Kolena commented Oct 9, 2024

munkyshi Oct 9, 2024

grant-Kolena Oct 9, 2024

munkyshi Oct 9, 2024

grant-Kolena Oct 9, 2024

munkyshi commented Oct 9, 2024 •

edited

Loading

grant-Kolena commented Oct 9, 2024

grant-Kolena commented Oct 10, 2024

munkyshi commented Oct 10, 2024

munkyshi left a comment

grant-Kolena commented Oct 11, 2024

grant-Kolena commented Oct 11, 2024

		from kolena.errors import IncorrectUsageError


		class PerformanceDelta(str, enum.Enum):

		if reference_model == model and not reference_eval_config:
		reference_eval_config = eval_config

grant/kol 7429 programmatic retrieval of significance #705

grant/kol 7429 programmatic retrieval of significance #705

Conversation

grant-Kolena commented Oct 8, 2024

Linked issue(s)

What change does this PR introduce and why?

Please check if the PR fulfills these requirements

grant-Kolena Oct 8, 2024

Choose a reason for hiding this comment

munkyshi Oct 9, 2024

Choose a reason for hiding this comment

grant-Kolena Oct 9, 2024

Choose a reason for hiding this comment

munkyshi commented Oct 9, 2024

grant-Kolena commented Oct 9, 2024

munkyshi Oct 9, 2024

Choose a reason for hiding this comment

grant-Kolena Oct 9, 2024

Choose a reason for hiding this comment

munkyshi Oct 9, 2024

Choose a reason for hiding this comment

grant-Kolena Oct 9, 2024

Choose a reason for hiding this comment

munkyshi commented Oct 9, 2024 • edited Loading

grant-Kolena commented Oct 9, 2024

grant-Kolena commented Oct 10, 2024

munkyshi commented Oct 10, 2024

munkyshi left a comment

Choose a reason for hiding this comment

grant-Kolena commented Oct 11, 2024

grant-Kolena commented Oct 11, 2024

munkyshi commented Oct 9, 2024 •

edited

Loading