Discrepancy in calculation on Table 2 in OffsetBias Paper #4

steve2972 · 2025-02-05T04:26:26Z

Hi, I was wondering how you guys calculated the averages seen in Table 2 of your paper.

Model Name	LLM Bar	HHH	MT-Bench	Simple Average	Weighted Average (by dataset size)	Reported average	Difference (%)
GPT-4o-0513	89.5	93.7	84.5	89.23333	85.53334	85.9	0.4268%
GPT-3.5-0613	64.7	86.4	62.3	71.13333	63.88491	64.4	0.7998%
Llama-8B-Instruct	66.1	87.8	62.5	72.13333	64.27715	64.9	0.9597%
Base data	82.6	83.7	78.6	81.63333	79.30165	79.5	0.2495%
OffsetBias	91.9	90	88.4	90.1	88.85559	89.0	0.1623%

The weighted average is calculated using

# Weight by LLM Bar, HHH, MT-Bench
# Total number of data = 3995
weighted_average = lambda a,b,c: 419/3995 * a + 221/3995 * b + 3355/3995 * c

If it was a rounding error that causes this discrepancy, I would expect the difference % to be the same across all models.

Can you clarify how you calculated the metrics in your paper?

parkjunsoo91 · 2025-02-20T11:34:21Z

The number of data for MT-bench is actually 2354 instead of 3355. This should fix the calculation.

To clarify the MT-Bench data processing:

We actually filtered out all dialogues having over 2 turns, or having "tie" labels. Also we included both "human" and "gpt4_pair" subsets. Which gives a count of 1284 and 1070, respectively. Which sum up to 2354. This count is shown in Table 1 as "MT-Bench Human n=2,568 / GPT4-Pair n=2,140", which is the same number multiplied by 2 to count both swaps.
The paper's section for describing MT-Bench does not show this, and is rather misleading. We apologize for that.

Further clarification for the confusion:

The data processing code in run_bench.py reflects this filtering, although partially. At the time of writing the paper, we excluded all "tie" labels as said above, but did not exclude "tie (inconsistent)" labels, which only exist in "gpt4_pair" subset. This has some consequences:
- Test data in "gpt4_pair" subset had 188 examples with wrong labels: they all had label of "B" instead of a tie label. This distorts all models' "accuracy" scores for "gpt4_pair" column. "Pair agreement" scores are less affected because they don't use labels.
- The current code fixed this issue and excludes "tie (inconsistent)" labels as well. As a result, if you run the code now it will give a different data count from the paper for the "gpt4_pair" subset.

Provide feedback