Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I keep getting "index out of range in self" during forward pass #33

Open
manaswimancha opened this issue Jan 27, 2024 · 1 comment
Open

Comments

@manaswimancha
Copy link

I am fine tuning a Longformer Encoder Decoder model for multi document text summarization. When I try to run through the forward pass, it gives me an error "index out of range in self". The input shape seems to be correct, but the debugger points to something in torch Embedding going wrong. How do I fix this?

num_epochs = 8
num_training_steps = num_epochs * len(train_dataloader)
optimizer = Adam(MODEL.parameters(), lr=3e-5)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=1, num_training_steps=num_training_steps # CHANGE LATER!!!!!!!
)
progress_bar = tqdm(range(num_training_steps))

# Training mode
MODEL.train()

for epoch in range(num_epochs):
  for batch_idx, batch in enumerate(train_dataloader):

    # Encode data
    input_ids_all = []
    for cluster in batch["document"]:
      articles = cluster.split("|||||")[:-1]
      for i, article in enumerate(articles):
        article = article.replace("\n", " ")
        article = " ".join(article.split())
        articles[i] = article
      input_ids = []
      for article in articles:
        input_ids.extend(TOKENIZER.encode(article, truncation=True, max_length=4096 // len(articles))[1:-1])
        input_ids.append(DOCSEP_TOKEN_ID)
      input_ids = ([TOKENIZER.bos_token_id]+input_ids+[TOKENIZER.eos_token_id])
      input_ids_all.append(torch.tensor(input_ids))
      input_ids = torch.nn.utils.rnn.pad_sequence(input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID)

    # Forward pass
    global_attention_mask = torch.zeros_like(input_ids)
    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1

    print(input_ids.shape)
    # outputs = MODEL.forward(input_ids) # <---------------------------------------------------------------------------------------------- causing a bug
    outputs = MODEL.forward(input_ids=input_ids_all, global_attention_mask=global_attention_mask)

    # Backprop
    loss = outputs.loss
    loss.backward()

    # GD
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

    # Decode output
    generated_str = TOKENIZER.batch_decode(generated_ids.tolist(), skip_special_tokens=True)
    metric.add_batch(predictions=generated_str, references=batch["summary"])

    # Calculate metrics
    print(f"Epoch: {epoch+1}, Batch: {batch_idx+1}:")
    print(metric.compute())
@hj940709
Copy link

hj940709 commented Oct 28, 2024

I had the same problem and I had to spend a long time debugging it. It turns out the BOS and EOS are not included in the seq counting for max_input_len, which is 4096. So when the input docs are long enough, their concatenation seq after tokenization may be "Index out of range" for position embedding. For me, once I set max_length=4094 // len(articles) for tokenizer, the bug goes away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants