-
Notifications
You must be signed in to change notification settings - Fork 266
Questions about evaluation results #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I got similar results |
Could you please provide detailed evaluation parameters so we can see where the problem lies? |
Thanks for your response. I just used the evaluation script as mentioned in readme, and got different results from notion. #Qwen2.5-Math-Instruct Series #Qwen2.5-Math-7B-Instruct |
I'm using the same script! @Zeng-WH |
I got similar results @Zeng-WH |
Uh oh!
There was an error while loading. Please reload this page.
Thanks for your work!
I'm curious about the evaluation results you reported about Qwen-2.5-Math-7B-Instruct and hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero:
I use your script to evaluate both Qwen-2.5-Math-7B-Instruct and hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero models and get the following results:

The gaps are pretty large such that it will diminish your improvements of Qwen2.5-7B-SimpleRL-Zero if compared to Qwen-2.5-Math-7B-Instruct.
The text was updated successfully, but these errors were encountered: