Skip to content

yaolu-zjut/Navigation_LLM_layer_pruning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reassessing Layer Pruning in LLMs: New Insights and Methods

Introduction

image Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25% of layers followed by fine-tuning the \texttt{lm_head} and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.

Supported LLMs:

Our Pruned Models

Step-by-step Instructions

1. Download Hellaswag from Huggingface:

python hf_download.py --dataset  Rowan/hellaswag  --save_dir saved_path

2. Download Vicuna-7b-v1.5 from Huggingface:

python hf_download.py --model  lmsys/vicuna-7b-v1.5  --save_dir saved_path

3. Llama-3.1-8B-Instruct Pruning with 8 layers pruned using reverse-order:

CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 python prune_llm.py --base_model Llama-3.1-8B-Instruct --save_model  --pr_method tail --remove_layer 8

4. Llama-3.1-8B-Instruct Finetuning with LoRA:

CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 python finetune_pruned.py --base_model Llama-3.1-8B-Instruct --save_model --pr_method  tail  --remove_layer 8 --prune_model_path your_path

5. Llama-3.1-8B-Instruct Finetuning with Partial-Layer Finetuning:

CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 python partial_fine-tuning.py --base_model Llama-3.1-8B-Instruct --save_model  --prune_model_path your_path  --partial_layer_name last3

6. Evaluating the Performance of the Pruned Llama-3.1-8B-Instruct (with LoRA) using lm-evaluation-harness:

#CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 lm_eval --model hf  --model_args pretrained=model_path,trust_remote_code=True,peft=lora_path,parallelize=True --tasks mmlu,cmmlu,piqa,openbookqa,winogrande,hellaswag,arc_easy,arc_challenge  --device cuda:0  --batch_size auto  --num_fewshot 0

7. Evaluating the Performance of the Pruned Llama-3.1-8B-Instruct (with Partial-Layer Finetuning) using lm-evaluation-harness:

CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 lm_eval --model hf  --model_args pretrained=model_path,trust_remote_code=True,parallelize=True --tasks mmlu,cmmlu,piqa,openbookqa,winogrande,hellaswag,arc_easy,arc_challenge  --device cuda:0  --batch_size auto  --num_fewshot 0

8. Testing MACs, Params and Memory:

CUDA_VISIBLE_DEVICES=0 TRANSFORMERS_OFFLINE=1 python test_speedup.py --base_model_path model_path

Zero-shot Evaluation

image

Acknowledgement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages