Llama 3.1 model act differently between AutoModelForCausalLM and LlamaForCausalLM #1330

TangMohan · 2025-03-08T19:22:15Z

Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the FAQs and existing/past issues

Describe the bug

I have one set of weights, one tokenizer, the same prompt, and identical generation parameters. Yet somehow, when I load the model using AutoModelForCausalLM, I get one output, and when I construct it manually with LlamaForCausalLM plus the same config and state_dict, I get another output entirely.

This code can show the difference on both a6000 and a100.

Minimal reproducible example

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    LlamaForCausalLM,
    LlamaConfig
)

# 1) Adjust these as needed
model_name = "meta-llama/Llama-3.1-8B"
prompt = "Hello from Llama 3.1! Tell me something interesting."
dtype = torch.float16  # or torch.float32 if needed

# 2) Get the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

# Prepare input
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

############################################
# A) Load with AutoModelForCausalLM
############################################

print("=== Loading with AutoModelForCausalLM ===")

model_auto = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="eager",  # matches your usage
    torch_dtype=dtype
).cuda()
model_auto.eval()  # turn off dropout
config = model_auto.config
with torch.no_grad():
    out_auto = model_auto(**inputs)
logits_auto = out_auto.logits  # shape: [batch_size, seq_len, vocab_size]

del model_auto
torch.cuda.empty_cache()

############################################
# B) Load with LlamaForCausalLM + config
############################################

print("=== Loading with LlamaForCausalLM + config ===")

# Get config from the same checkpoint
# Build Llama model directly
model_llama = LlamaForCausalLM(config).cuda()
model_llama.eval()

# Load the same weights that AutoModelForCausalLM used
model_auto_temp = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=dtype)
model_llama.load_state_dict(model_auto_temp.state_dict())
del model_auto_temp
torch.cuda.empty_cache()

with torch.no_grad():
    out_llama = model_llama(**inputs)
logits_llama = out_llama.logits

############################################
# C) Compare the Logits
############################################

# Compute maximum absolute difference
max_diff = (logits_auto - logits_llama).abs().max()
print(f"\nMax absolute difference between logits: {max_diff.item()}")

if max_diff < 1e-7:
    print("→ The logits are effectively identical (within floating-point precision).")
else:
    print("→ There is a non-trivial difference in logits!")

Output

Max absolute difference between logits: 0.11245954036712646
→ There is a non-trivial difference in logits!

Runtime Environment

Model: meta-llama/Llama-3.1-8B
Using via huggingface?: yes
OS: Linux
GPU VRAM: 40GB
Number of GPUs: 1
GPU Make: Nvidia

Additional context
Add any other context about the problem or environment here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3.1 model act differently between AutoModelForCausalLM and LlamaForCausalLM #1330

Llama 3.1 model act differently between AutoModelForCausalLM and LlamaForCausalLM #1330

TangMohan commented Mar 8, 2025

Llama 3.1 model act differently between AutoModelForCausalLM and LlamaForCausalLM #1330

Llama 3.1 model act differently between AutoModelForCausalLM and LlamaForCausalLM #1330

Comments

TangMohan commented Mar 8, 2025

Describe the bug

Minimal reproducible example

Output

Runtime Environment