Skip to content

0amp/ParaLLaMA

Repository files navigation

ParaLLaMA

Run large LLaMA models with long context lengths via a tensor parallelization port to parallelformers. Using 4x 16GB GPUs, I'm able to run LLaMA 13B with up to 2048 length contexts.

Installation

Create a new python environment and then run pip install -r requirements.txt. No need to install any different branch of the transformers library, all LLaMA class definitions are contained in this repo.

Demo

Run python parallelization_example.py in conda with the appropriate model and prompt to see it in action! Refer to the inference_example.py and training_example.py for single-GPU / CPU demonstrations.

Note, parallelformers requires each parallelize call to be wrapped in a if __name__ == "__main__": block. Additionally, parallelformers only supports inference, not training.

Credits: Significant portions of the codebase are ported from the PR by Jason Phang. Inference and training examples are from user-friendly LLaMA repo from Yam Peleg.

About

Minimal port of LLaMA-HF to Parallelformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages