Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError in position embedding (potentially due to missing clear_cache between batches of data) #1340

Open
ameyagodbole opened this issue Feb 25, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@ameyagodbole
Copy link

Describe the bug
I have been trying to use lm-evaluation-harness with gpt-neox/eval.py. AFAIK the other query types besides generate_until work fine. With generate_until, I run into this assertion check (in the position embedding module) after a couple of examples have been processed.

assert seq_len <= self.max_seq_len

In my testing, the model is about to generate (say) token 48. I have verified that the token_index_to_generate in gpt-neox/megatron/text_generation_utils.py is in fact 48. But somehow RotaryEmbedding is trying to create an embedding for position 1025 (beyond the model_max_length).

To Reproduce
Will fill in reproducible configs. Currently, I'm using a model with a custom config (but trained in neox) and evalauting on a QA dataset (where eval-harness uses generate_until).

Proposed solution
I suspect the issue is caused by a missing clear_cache() between batches of data. Adding model.module.clear_cache() at the start of gpt-neox/megatron/text_generation_utils.py:stream_tokens seems to fix it on my side.

I am unsure whether this is correct and if it's a complete fix. The same clear_cache operation seems to be invoked in generate_samples_interactive but not in generate_samples_from_prompt.

Environment (please complete the following information):

@ameyagodbole ameyagodbole added the bug Something isn't working label Feb 25, 2025
@ameyagodbole ameyagodbole changed the title Missing clear_cache in stream_tokens leads to resue of cache from previous examples AssertionError in position embedding (potentially due to missing clear_cache between batches of data) Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant