LLaMAMoE fixes #2014

ysjprojects · 2025-04-13T16:09:59Z

Addresses #2013

Currently, there is a mismatch in MoE block implementation between litgpt and hf for Mixtral models.

hf:

routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

In litgpt, softmax operation is performed after selecting the top k experts. While the experts are preserved, the probability values are different and will affect downstream calculations. The L1 normalization in hf implementation is missing in litgpt implementation as well.

Borda

btw, can we add a test for this fix?

ysjprojects · 2025-04-15T02:53:36Z

btw, can we add a test for this fix?

Hi, can I clarify what will the test look like?

Borda · 2025-04-15T10:15:08Z

Hi, can I clarify what will the test look like?

something which would "justify" this change, so having a case that would be failing before and passing no so we won't accidentally revert this change

ysjprojects · 2025-05-16T06:56:51Z

Hi, can I clarify what will the test look like?

something which would "justify" this change, so having a case that would be failing before and passing no so we won't accidentally revert this change

On second thought, the current implementation passes all tests and the logits are close enough to the HF model's.

There is a slight deviation but ultimately it does not actually affect the model, so maybe we can close this PR.

Borda · 2025-05-16T08:03:51Z

On second thought, the current implementation passes all tests and the logits are close enough to the HF model's.
There is a slight deviation but ultimately it does not actually affect the model, so maybe we can close this PR.

sure my thinking is this cool fix what prevents us from accidentally reverting it, since all tests would be passing

t-vi

If you do the math, it seems that the new method is just a particularly clumsy way to compute the same value and this is why you would not see much of a difference.

Unless it is causing real problems in real implementations, we should keep the way we currently have with topk first and then softmax.

Edit: If it is causing real problems, we can change it (e.g. we did adapt to some Llama version needing specific casting for the RoPE cache at some point), but then we should add commentary.

ysjprojects · 2025-05-23T01:03:18Z

If you do the math, it seems that the new method is just a particularly clumsy way to compute the same value and this is why you would not see much of a difference.

Unless it is causing real problems in real implementations, we should keep the way we currently have with topk first and then softmax.

Edit: If it is causing real problems, we can change it (e.g. we did adapt to some Llama version needing specific casting for the RoPE cache at some point), but then we should add commentary.

I see now. Thanks for educating me!

LLaMAMoE: added L1 norm to probs

fc7892b

ysjprojects requested review from lantiga, t-vi and Borda as code owners April 13, 2025 16:10

Borda approved these changes Apr 14, 2025

View reviewed changes

Borda reviewed Apr 14, 2025

View reviewed changes

Borda added 2 commits May 15, 2025 11:06

Merge branch 'main' into llamamoe-fix

7b5fccd

Merge branch 'main' into llamamoe-fix

ae0781c

t-vi requested changes May 16, 2025

View reviewed changes

Borda self-requested a review May 16, 2025 08:13

Borda marked this pull request as draft May 23, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLaMAMoE fixes #2014

LLaMAMoE fixes #2014

Uh oh!

ysjprojects commented Apr 13, 2025

Uh oh!

Borda left a comment

Uh oh!

ysjprojects commented Apr 15, 2025

Uh oh!

Borda commented Apr 15, 2025

Uh oh!

ysjprojects commented May 16, 2025

Uh oh!

Borda commented May 16, 2025

Uh oh!

t-vi left a comment •

edited

Loading

Uh oh!

ysjprojects commented May 23, 2025

Uh oh!

Uh oh!

LLaMAMoE fixes #2014

Are you sure you want to change the base?

LLaMAMoE fixes #2014

Uh oh!

Conversation

ysjprojects commented Apr 13, 2025

Uh oh!

Borda left a comment

Choose a reason for hiding this comment

Uh oh!

ysjprojects commented Apr 15, 2025

Uh oh!

Borda commented Apr 15, 2025

Uh oh!

ysjprojects commented May 16, 2025

Uh oh!

Borda commented May 16, 2025

Uh oh!

t-vi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ysjprojects commented May 23, 2025

Uh oh!

Uh oh!

t-vi left a comment •

edited

Loading