Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in calculation on Table 2 in OffsetBias Paper #4

Open
steve2972 opened this issue Feb 5, 2025 · 1 comment
Open

Discrepancy in calculation on Table 2 in OffsetBias Paper #4

steve2972 opened this issue Feb 5, 2025 · 1 comment

Comments

@steve2972
Copy link

Hi, I was wondering how you guys calculated the averages seen in Table 2 of your paper.

Model Name LLM Bar HHH MT-Bench Simple Average Weighted Average (by dataset size) Reported average Difference (%)
GPT-4o-0513 89.5 93.7 84.5 89.23333 85.53334 85.9 0.4268%
GPT-3.5-0613 64.7 86.4 62.3 71.13333 63.88491 64.4 0.7998%
Llama-8B-Instruct 66.1 87.8 62.5 72.13333 64.27715 64.9 0.9597%
Base data 82.6 83.7 78.6 81.63333 79.30165 79.5 0.2495%
OffsetBias 91.9 90 88.4 90.1 88.85559 89.0 0.1623%

The weighted average is calculated using

# Weight by LLM Bar, HHH, MT-Bench
# Total number of data = 3995
weighted_average = lambda a,b,c: 419/3995 * a + 221/3995 * b + 3355/3995 * c

If it was a rounding error that causes this discrepancy, I would expect the difference % to be the same across all models.

Can you clarify how you calculated the metrics in your paper?

@parkjunsoo91
Copy link
Collaborator

The number of data for MT-bench is actually 2354 instead of 3355. This should fix the calculation.

To clarify the MT-Bench data processing:

  • We actually filtered out all dialogues having over 2 turns, or having "tie" labels. Also we included both "human" and "gpt4_pair" subsets. Which gives a count of 1284 and 1070, respectively. Which sum up to 2354. This count is shown in Table 1 as "MT-Bench Human n=2,568 / GPT4-Pair n=2,140", which is the same number multiplied by 2 to count both swaps.
  • The paper's section for describing MT-Bench does not show this, and is rather misleading. We apologize for that.

Further clarification for the confusion:

  • The data processing code in run_bench.py reflects this filtering, although partially. At the time of writing the paper, we excluded all "tie" labels as said above, but did not exclude "tie (inconsistent)" labels, which only exist in "gpt4_pair" subset. This has some consequences:
    • Test data in "gpt4_pair" subset had 188 examples with wrong labels: they all had label of "B" instead of a tie label. This distorts all models' "accuracy" scores for "gpt4_pair" column. "Pair agreement" scores are less affected because they don't use labels.
    • The current code fixed this issue and excludes "tie (inconsistent)" labels as well. As a result, if you run the code now it will give a different data count from the paper for the "gpt4_pair" subset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants