You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The number of data for MT-bench is actually 2354 instead of 3355. This should fix the calculation.
To clarify the MT-Bench data processing:
We actually filtered out all dialogues having over 2 turns, or having "tie" labels. Also we included both "human" and "gpt4_pair" subsets. Which gives a count of 1284 and 1070, respectively. Which sum up to 2354. This count is shown in Table 1 as "MT-Bench Human n=2,568 / GPT4-Pair n=2,140", which is the same number multiplied by 2 to count both swaps.
The paper's section for describing MT-Bench does not show this, and is rather misleading. We apologize for that.
Further clarification for the confusion:
The data processing code in run_bench.py reflects this filtering, although partially. At the time of writing the paper, we excluded all "tie" labels as said above, but did not exclude "tie (inconsistent)" labels, which only exist in "gpt4_pair" subset. This has some consequences:
Test data in "gpt4_pair" subset had 188 examples with wrong labels: they all had label of "B" instead of a tie label. This distorts all models' "accuracy" scores for "gpt4_pair" column. "Pair agreement" scores are less affected because they don't use labels.
The current code fixed this issue and excludes "tie (inconsistent)" labels as well. As a result, if you run the code now it will give a different data count from the paper for the "gpt4_pair" subset.
Hi, I was wondering how you guys calculated the averages seen in Table 2 of your paper.
The weighted average is calculated using
If it was a rounding error that causes this discrepancy, I would expect the difference % to be the same across all models.
Can you clarify how you calculated the metrics in your paper?
The text was updated successfully, but these errors were encountered: