-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing the flan_v2 results of T5-xl #80
Comments
@danczs Thanks for the question. A couple thoughts:
Sorry this could not be more helpful. It's hard to translate the internal code (which I no longer have access to) to external implementations. I would also note that my co-authors did A LOT of tuning and runs with internal configuration to get the 52% number. The variability per run with the same data can vary by 1-2% max performance, and then between checkpoints on the same run you might see another 1-2% variability even after its converged. (Just something to keep in mind.) Best, |
@shayne-longpre Thanks very much for your reply.
Thanks for your explanations, it helps a lot. |
@danczs Hmm I'm not sure why it was so low. I noticed that a few recent papers seem to have gotten strong results with a 100k sample of the training data (e.g. https://arxiv.org/pdf/2306.04751.pdf) and their training code is public. Also, maybe Hyung Won's recent comments provide some insights here? |
First, thanks for this excellent work. However, I met some problems when reproducing the results of T2-xl.
My setting is:
Pretrained model and optimizer:
I used the T5-v1_1-xl pretrained model and following the training setting in "Scaling Instruction-Finetuned Language Models": batch size 64, Dropout 0.05, LR 5e-4, 38K steps, adafactor optimizer.
Data:
For the data, I first used the training data provided by SirNeural and evaluated the model on MMLU. When I equally sampled the 5 datasets (i.e. cot, flanv2, t0, diglog, niv2), I got 45% 5-shot accuracy on MMLU, which is similar to the w/o mixture balancing result in the paper. However, after I mixed the data with the suggested rates here, the accuracy is not improved (44%).
Afterwards, I tried the data provided by Enrico Shippole and mixed the data following the suggested rates. But the accuracy became worse (42% on MMLU). I also tried to use a larger batch size (128, considering batch packing ) and deduplicate the data, which nearly didn't help.
Are there any suggestions to reproduce the MMLU results of the released Flan-xl-t5 model (49%) or even the results in the paper(52%) ? Thanks a lot.
The text was updated successfully, but these errors were encountered: