-
Notifications
You must be signed in to change notification settings - Fork 1.2k
LLaMAMoE fixes #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LLaMAMoE fixes #2014
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, can we add a test for this fix?
Hi, can I clarify what will the test look like? |
something which would "justify" this change, so having a case that would be failing before and passing no so we won't accidentally revert this change |
On second thought, the current implementation passes all tests and the logits are close enough to the HF model's. There is a slight deviation but ultimately it does not actually affect the model, so maybe we can close this PR. |
sure my thinking is this cool fix what prevents us from accidentally reverting it, since all tests would be passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do the math, it seems that the new method is just a particularly clumsy way to compute the same value and this is why you would not see much of a difference.
Unless it is causing real problems in real implementations, we should keep the way we currently have with topk first and then softmax.
Edit: If it is causing real problems, we can change it (e.g. we did adapt to some Llama version needing specific casting for the RoPE cache at some point), but then we should add commentary.
I see now. Thanks for educating me! |
Addresses #2013
Currently, there is a mismatch in MoE block implementation between litgpt and hf for Mixtral models.
hf:
In litgpt, softmax operation is performed after selecting the top k experts. While the experts are preserved, the probability values are different and will affect downstream calculations. The L1 normalization in hf implementation is missing in litgpt implementation as well.