Setting up Megatron-LM on the smol-cluster
After having installed miniconda in your user space at /fsx
. Log into a compute node and follow these steps:
# create new env
conda create --name megatron
conda activate megatron
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
pip install packaging
# clone repos in code folder under fsx user space
cd /fsx/loubna
mkdir code && cd code
git clone https://github.com/bigcode-project/Megatron-LM.git
git clone https://github.com/NVIDIA/apex.git
# sometimes apex confuses cuda versions
export CUDA_HOME=/usr/local/cuda-11.6
cd code/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Extra dependencies:
pip install wandb transformers
wandb login
In data folder (e.g /fsx/loubna/data
), download tokenizer and preprocessed data
cd data
git clone https://huggingface.co/bigcode/digit-bytelevel-bpe-jss-v1.1-49152
Download tokenized and preprocessed Santacoder data by Megatron-LM from GCP in a folder data
, install gcloud
or use rclone
(already installed).
Other slurm
examples can be found in examples
folder.
A slurm file for submitting a job is train.slurm
adapted from brrr/examples/t5 by Thomas Wang, it must be placed inside Megatron-LM
, change it accordingly and run:
sbatch train.slurm
If the job is terminated, check the logs to find termination reason. You might get some of the following errors:
- If you get
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'
: Go to your~/.bashrc
and comment these lines:
# If not running interactively, don't do anything
# case $- in
# *i*) ;;
# *) return;;
# esac
- If you get errors about missing C++ libraries like
pybind11
andninja
, run
conda install -c conda-forge pybind11
conda install -c conda-forge ninja
- If you get an error like:
21 | #include <cuda_profiler_api.h>
[ip-26-x]: | ^~~~~~~~~~~~~~~~~~~~~
[ip-26-x]:compilation terminated.
Make sure the correct cuda is loaded by adding this to your ~/.bashrc
:
export PATH="/usr/local/cuda-11.6/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.6/lib64:$LD_LIBRARY_PATH"
If you keep saving temporary files at /tmp
or admin
, you might run out of disk space, you can change TMPDIR
to /scratch
in the compute nodes with has 6T or to /fsx
and clean it manually from time to time. (TODO: issue when adding mkdir
to slurm job to create a folder in scratch
, it seems to only create it in the first node):
# add to ~/.bashrc
export TMPDIR=/fsx/loubna/deleteme/tmp
# for wandb cache
export WANDB_CACHE_DIR=/fsx/loubna/deleteme/wandb
Check your wandb board 🚀 or run this to check GPU utilization of your nodes:
# get jobid with squeue
NODES=$(scontrol show hostname `squeue -j JOBID --noheader -o %N`)
for ssh_host in $NODES
do
echo $ssh_host
ssh -q $ssh_host "nvidia-smi --format=csv --query-gpu=utilization.gpu,utilization.memory"
done
In scaling_laws
folder you can find a python script to submit multiple slurm jobs based on different parameters from a csv
file we used in our scaling laws experiments.
python3 submit_jobs.py
sinfo
view information about Slurm nodes and partitions
squeue
view informatioon about running jobs
scancel jobid
cancel a running job
scontrol show jobid
display job state
To show full names of jobs (they are long and get truncated with squeue
):
squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %R" --me
To cancel all jobs from 2600 to 2640:
scancel $(seq 2675 2742)
The script for submitting multiple jobs saves many logs, to find latest modified log file of the job at index 39 in the spreadsheet, for example, do:
log_path=$(find logs -name "*idx_39-*" -printf '%T@ %f\n' | sort -rn | head -n 1 | cut -d ' ' -f 2-)
tail -f logs/$log_path
To access a compute node in bash
srun --nodes=1 --cpus-per-task=48 --gres=gpu:4 --mem-per-cpu=11G --partition=production-cluster --pty bash