Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grant/kol 7429 programmatic retrieval of significance #705

Merged

Conversation

grant-Kolena
Copy link
Contributor

  • add functionality to display performance delta when using download_quality_standard_result
  • added test

Linked issue(s)

What change does this PR introduce and why?

Please check if the PR fulfills these requirements

  • Include reference to internal ticket and/or GitHub issue "Fixes #NNNN" (if applicable)
  • Relevant tests for the changes have been added
  • Relevant docs have been added / updated

@grant-Kolena grant-Kolena requested a review from a team as a code owner October 8, 2024 22:10
"""


def z_score_to_probability(z: float) -> float:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied and rewritten from kolena FE's implementation of the same functionality

@munkyshi munkyshi changed the title grant/kol 7429 rad ai programmatic retrieval of significance grant/kol 7429 programmatic retrieval of significance Oct 9, 2024
from kolena.errors import IncorrectUsageError


class PerformanceDelta(str, enum.Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these serialize well to CSV? Will users ever be expected to know about PerformanceDelta? In general, for anything that a user deals with, I'd prefer to use something like

PerformanceDelta = Literal["improved", "regressed", "similar", "unknown"]

So that the user doesn't need to know about some random enum type, and can instead just work with things as strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm export to CSV works fine, but will still change to Literal so the user don't have to see the enum

@munkyshi
Copy link
Contributor

munkyshi commented Oct 9, 2024

Can you share an example of how to run this? I'm trying to figure out how to best set up a dataset / QS for this -- is there an example that you already have set up that I can play with?

@grant-Kolena
Copy link
Contributor Author

Can you share an example of how to run this? I'm trying to figure out how to best set up a dataset / QS for this -- is there an example that you already have set up that I can play with?

Yes, I have played around with this dataset on my tenant

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

Comment on lines 303 to 306
:param reference_eval_config: The evaluation configuration to use in conjunction with the reference model,
if unspecified the first model in the models list will be used
as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None,
the first eval config will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param reference_eval_config: The evaluation configuration to use in conjunction with the reference model,
if unspecified the first model in the models list will be used
as a reference for the Margin of Error, for None eval config, pass in 'null' instead of None. If this is None,
the first eval config will be used.
:param reference_eval_config: The evaluation configuration to use in conjunction with the `reference_model`.
If this is `None`, the first eval config will be used. If you wish to use an evaluation configuration of `None`, please instead specify the configuration as `'null'`.

Also, seems a bit odd to make the user specify null instead of using None -- would it make sense to instead define it as:

reference_eval_config: Optional[Union[Dict[str, Any], Literal["first"]]] = "first",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep make sense, I was just trying to distinguish between null and None

Comment on lines 203 to 206
if reference_model == model and not reference_eval_config:
reference_eval_config = eval_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the first eval_config will be skipped if it is None?

# initially reference_model is None and reference_eval_config
# First pass, model is 'foo' and eval_config is None
reference_model = 'foo'
reference_eval_config = None
# Second pass, model is 'foo' and eval_config is {'bar': 'baz'}
reference_eval_config = {'bar': 'baz'}

Also, does this get handled safely if the only eval config for the reference model is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I was expecting user to pass in "null" so it won't be ignored, I will have to handle this case now. This should work as expected when the only eval config is None, as the dataset I used to test is like this

@munkyshi
Copy link
Contributor

munkyshi commented Oct 9, 2024

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)

One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp
Screenshot 2024-10-09 at 17 34 46

The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

@grant-Kolena
Copy link
Contributor Author

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)

One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp Screenshot 2024-10-09 at 17 34 46

The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

Weird, the algorithm is exactly the same, I will calculate it again to confirm. Could it be that the color on webapp is too light to see

@grant-Kolena grant-Kolena force-pushed the grant/kol-7429-rad-ai-programmatic-retrieval-of-significance branch from 9a105d4 to b5f4e6b Compare October 10, 2024 15:52
@grant-Kolena
Copy link
Contributor Author

from kolena._experimental import download_quality_standard_result
qs_result = download_quality_standard_result("attack_on_llms_small", ["databricks_dbrx-instruct", "gpt-4-turbo"], confidence_level = 0.95, reference_model = "gpt-4-turbo")

I tried playing around with this, and it looks like the colors aren't matching with the webapp (link)
One example (from when I dumped this to CSV):

model,,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo,databricks_dbrx-instruct,databricks_dbrx-instruct,gpt-4-turbo,gpt-4-turbo
eval_config,,null,null,null,null,null,null,null,null
metric_group,,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group,Metric Group
metric,,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count,rate rejected,Mean Metadata.count
type,,metric_value,metric_value,metric_value,metric_value,performance_delta,performance_delta,performance_delta,performance_delta
stratification,test_case,,,,,,,,
...
type,Raw,0.03333333333333333,,0.1,0.0,regressed,unknown,similar,similar
type,Role Playing,0.15555555555555556,0.6666666666666666,0.12222222222222222,,improved,unknown,similar,unknown

vs webapp Screenshot 2024-10-09 at 17 34 46
The endpoint seems to indicate regressed and improved, but there's no coloration on the webapp (both at confidence level of 0.95)

Weird, the algorithm is exactly the same, I will calculate it again to confirm. Could it be that the color on webapp is too light to see

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

@grant-Kolena grant-Kolena requested a review from munkyshi October 10, 2024 21:06
@munkyshi
Copy link
Contributor

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

What was the issue, the check for and max_value > 0:? Is there anything we can do to test around that case specifically, since it was something that was missed? Change looks good otherwise.

Copy link
Contributor

@munkyshi munkyshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock, pending test to track the latest (now fixed) issue.

@grant-Kolena
Copy link
Contributor Author

this matches with the UI now, not sure what type of test would be meaningful to these mathematical calculations, it is hard to identify edge cases here. We already have a test case to check if the produced performance delta match the expected result.

What was the issue, the check for and max_value > 0:? Is there anything we can do to test around that case specifically, since it was something that was missed? Change looks good otherwise.

I missed the part where the max value for performance delta is the max across the stratification, except for when the max value is 0 or less.

@grant-Kolena grant-Kolena merged commit 42ec3d2 into trunk Oct 11, 2024
57 checks passed
@grant-Kolena grant-Kolena deleted the grant/kol-7429-rad-ai-programmatic-retrieval-of-significance branch October 11, 2024 13:54
@grant-Kolena
Copy link
Contributor Author

Approving to unblock, pending test to track the latest (now fixed) issue.

created this ticket to track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants