PyTorch implementation and pretrained models for SimDINO and SimDINOv2.
Authors: Ziyang Wu, Jingyuan Zhang, Druv Pai, Xudong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma
[02/25/25] We release code and pretrained checkpoints for SimDINO and SimDINOv2.
We provide checkpoints for both SimDINO and SimDINOv2 pretrained on ImageNet-1k for 100 epochs following configs detailed in our paper.
model | # of params |
Algorithm | ImageNet k-NN |
ImageNet linear |
download |
---|---|---|---|---|---|
ViT-B/16 | 86 M | SimDINO | 74.9% | 77.3% | ckpt |
ViT-L/16 | 300 M | SimDINO | 75.6% | 77.4% | ckpt |
ViT-B/16 | 86 M | SimDINOv2 | 78.1% | 79.7% | ckpt |
ViT-L/16 | 300 M | SimDINOv2 | 81.1% | 82.4% | ckpt |
Below we also provide the checkpoints for the original DINO and DINOv2 models that we trained.
model | # of params |
Algorithm | ImageNet k-NN |
ImageNet linear |
download |
---|---|---|---|---|---|
ViT-B/16 | 86 M | DINO | 72.9% | 76.3% | ckpt |
ViT-B/16 | 86 M | DINOv2 | 76.0% | 77.2% | ckpt |
ViT-L/16 | 300 M | DINOv2 | 80.8% | 82.0% | ckpt |
Note: our compute resource is limited but we are working on scaling up our approach. Stay tuned for more model checkpoints in the future. Meanwhile, we always welcome and appreciate feedback and help from the community.
Our implementation requires Python 3.11+, PyTorch 2.4+ and xFormers 0.0.29+ and some other packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup the dependencies, please install via:
pip install -r requirements.txt
First, you need to download the ImageNet-1k dataset.
The root directory of the dataset should hold the following contents:
<ROOT>/test/ILSVRC2012_test_00000001.JPEG
<ROOT>/test/[..]
<ROOT>/test/ILSVRC2012_test_00100000.JPEG
<ROOT>/train/n01440764/n01440764_10026.JPEG
<ROOT>/train/[...]
<ROOT>/train/n15075141/n15075141_9993.JPEG
<ROOT>/val/n01440764/ILSVRC2012_val_00000293.JPEG
<ROOT>/val/[...]
<ROOT>/val/n15075141/ILSVRC2012_val_00049174.JPEG
<ROOT>/labels.txt
Specific to SimDINOv2, you need to configure and run python prepare.py
to generate some metadata files.
The generated files should have the following structure:
<EXTRA>/class-ids-TRAIN.npy
<EXTRA>/class-ids-VAL.npy
<EXTRA>/class-names-TRAIN.npy
<EXTRA>/class-names-VAL.npy
<EXTRA>/entries-TEST.npy
<EXTRA>/entries-TRAIN.npy
<EXTRA>/entries-VAL.npy
You can train SimDINO on ViT-B/16 with an 8-GPU node (each with at least 40G memory):
cd simdino
torchrun --nnodes=1 --nproc_per_node=8 main_dino.py --arch vit_base --patch_size 16 --local_crops_number 10 \
--eps 0.05 --coeff 1 --output_dir <PATH/TO/OUTPUT/DIR> --data_path <PATH/TO/DATASET/TRAIN> \
--track_wandb # to enable logging; use --track_wandb to log with wandb and --track_swan to log with swanlab
Training time is approximately 1.5 day and you should be able to replicate our reported results. An example log on ViT-B/16 can be found here.
You can train SimDINOv2 on ViT-L/16 with a 8-GPU node (each with at least 40G memory):
torchrun --nnodes=1 --nproc_per_node=8 simdinov2/train/train.py \
--config-file simdinov2/configs/simdino_config.yaml \
--output-dir <PATH/TO/OUTPUT/DIR> \
train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
Training time is approximately 1 day and you should be able to replicate our reported results. An example log on ViT-B/16 can be found here.
The training code saves the weights of the teacher in the eval
folder every 10 epochs for evaluation. You can change the student.arch
field in simdino_config.yaml
to train other models.
You can also use submitit
if your environment happens to be a SLURM cluster:
python simdinov2/run/train/train.py \
--nodes 1 \
--config-file simdinov2/configs/simdino_config.yaml \
--output-dir <PATH/TO/OUTPUT/DIR> \
train.dataset_path=ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
Q: How can I visualize the training losses?
A: In SimDINO, you can remove the --nowandb
argument to enable wandb logging.
Q: I notice some spikes in coding rate loss in early training stages. Is that normal?
A: Occasional spikes are normal and shouldn't impact final performance. If you notice too much instability, the following operations can help:
- set
--expa_type=1
. Sometimes spikes are caused by sudden change in conditioning of the covariance matrix and this applies some "smoothing" by centering the student features and teacher features. - set a smaller
--eps
.
Q: I can only use small batch sizes per gpu for training, what should I do?
A: You can set --reduce_cov=1
to collect covariance matrices from multiple gpus via all_reduce. Empirically, we found that we don't have to do this even with 64 samples per GPU.
The teacher weights are regularly saved and can be evaluated using the following scripts.
For example, on ViT-B/16:
cd simdino
torchrun --nproc_per_node=8 eval_knn.py --patch_size 16 --arch vit_base \
--pretrained_weights <PATH/TO/OUTPUT/DIR>/checkpoint.pth --data_path <PATH/TO/DATASET>
For example, on ViT-B/16:
cd simdino
torchrun --nproc_per_node=8 eval_linear.py --patch_size 16 --arch vit_base \
--pretrained_weights <PATH/TO/OUTPUT/DIR>/checkpoint.pth --data_path <PATH/TO/DATASET>
python simdinov2/run/eval/knn.py \
--config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
--pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
--output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/knn \
--train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
--val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
python simdinov2/run/eval/linear.py \
--config-file <PATH/TO/OUTPUT/DIR>/config.yaml \
--pretrained-weights <PATH/TO/OUTPUT/DIR>/eval/training_24999/teacher_checkpoint.pth \
--output-dir <PATH/TO/OUTPUT/DIR>/eval/training_24999/linear \
--train-dataset ImageNet:split=TRAIN:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET> \
--val-dataset ImageNet:split=VAL:root=<PATH/TO/DATASET>:extra=<PATH/TO/DATASET>
If you find this project useful, please consider giving us a star and citation:
@article{wu2025simplifying,
title={Simplifying DINO via Coding Rate Regularization},
author={Wu, Ziyang and Zhang, Jingyuan and Pai, Druv and Wang, XuDong and Singh, Chandan and Yang, Jianwei and Gao, Jianfeng and Ma, Yi},
journal={arXiv preprint arXiv:2502.10385},
year={2025}
}
This project is largely built upon the orignal DINO and DINOv2 projects.