AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340

ameyagodbole · 2025-02-25T19:15:18Z

Describe the bug
I have been trying to use lm-evaluation-harness with gpt-neox/eval.py. AFAIK the other query types besides generate_until work fine. With generate_until, I run into this assertion check (in the position embedding module) after a couple of examples have been processed.

gpt-neox/megatron/model/positional_embeddings.py

Line 88 in 9107b25

assert seq_len <= self.max_seq_len

In my testing, the model is about to generate (say) token 48. I have verified that the token_index_to_generate in gpt-neox/megatron/text_generation_utils.py is in fact 48. But somehow RotaryEmbedding is trying to create an embedding for position 1025 (beyond the model_max_length).

To Reproduce
Will fill in reproducible configs. Currently, I'm using a model with a custom config (but trained in neox) and evalauting on a QA dataset (where eval-harness uses generate_until).

Proposed solution
I suspect the issue is caused by a missing clear_cache() between batches of data. Adding model.module.clear_cache() at the start of gpt-neox/megatron/text_generation_utils.py:stream_tokens seems to fix it on my side.

I am unsure whether this is correct and if it's a complete fix. The same clear_cache operation seems to be invoked in generate_samples_interactive but not in generate_samples_from_prompt.

Environment (please complete the following information):

GPUs: 1x A6000
Configs: https://github.com/aflah02/gpt-neox/blob/olmo-support/configs/hubble/1_1B.yml

The text was updated successfully, but these errors were encountered:

ameyagodbole added the bug Something isn't working label Feb 25, 2025

ameyagodbole changed the title ~~Missing clear_cache in stream_tokens leads to resue of cache from previous examples~~ AssertionError in position embedding (potentially due to missing clear_cache between batches of data) Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340

ameyagodbole commented Feb 25, 2025

AssertionError in position embedding (potentially due to missing clear_cache between batches of data) #1340

AssertionError in position embedding (potentially due to missing clear_cache between batches of data) #1340

Comments

ameyagodbole commented Feb 25, 2025

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340

AssertionError in position embedding (potentially due to missing `clear_cache` between batches of data) #1340