-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry about the RoPE difference between FAESM and ESM2 #13
Comments
thanks for posting it. flash attention actually unpad the sequences. Say x is shape [B, L] normally for ESM2. But it contains some padding tokens in x, so in flash attention, we get rid of it and have x_unpad [N,], where N is the all the valid tokens across the B samples. If u print the shape using flash attention and esm, it's diff. So make sure u compare the right elements. also, u wanna compute the error, the diff. instead of staring at the numbers. |
Thanks for your response! Acturally I compared the exact elements because I indexed the input embeddings from the unpaded q and k generated by the FAESM scripts. The numerical errors I focused on indicate the theoretical compatibility of using ESM2's parameters, which were trained under a different RoPE method. In my task of fine-tuning ESM2's parameters, I prefer to use the original RoPE method from ESM2 to keep faithful to the original parameters, even though this approach might slow down the embedding process. Nevertheless, I will compare the two embeddings methods and looking forward to share the results with you. Anyway, FAESM is an excellent work that contributes to protein understanding and the AI4Bio community. Thanks! |
ya u wanna show the max abs diff. to see the error. BTW i trained all my models with FAESM, it works pretty well. |
Thanks for your excellent work!
When using the RotaryEmbedding before the attention score calculation, I noticed that differed from those generated by the vanilla ESM2 implementation with its original RoPE embedding method.
Although both embeddings seem to be implemented correctly, the differences may not be acceptable when using pretrained parameters from ESM2.
Here are the implementation details:
At
line:161
inesm.py
, during a protein embedding inference task (batch_size=2
), I generated the q and k with faesm's RoPE following faesm's scripts:The resulting outputs (partial example):
Subsequently, with the same input, I generated the embeddings using ESM2's RoPE method:
Unfortunately, I observed differences in the outputs:
The embedded features differ significantly except for the first row (unaffected by the cosine and sine table differences between the two methods).
I'm wondering whether using the speed-optimized FAESM poses a risk due to its deviation from the original ESM implementation? Thanks again for your work.
The text was updated successfully, but these errors were encountered: