-
when you want to fine-tune vision models with a different image size
(e.g., optimizing performance for a kaggle competition dataset with an image size of 64)- discussion on this use case
-
when you want supervied pretraining on imagenet-1k
-
when you want self-supervised pretraining on imagenet-1k
download the imagenet-1k dataset from huggingface dataset and arrange the data as follows
for more details, check data/
data/
├── train_images_0.tar.gz
├── train_images_1.tar.gz
├── train_images_2.tar.gz
├── train_images_3.tar.gz
├── train_images_4.tar.gz
└── val_images.tar.gz
pip install -r requirements.txt
unzip the imagenet-1k located in the data/
python -m data.preprocess
classic supervised learning on imagenet-1k dataset
python -m train --save_dir weights \
--model_name convnext_base \
--n_epoch 200 \
--batch_size 128 \
--n_worker 8 \
--n_device 8 \
--precision 16-mixed \
--strategy ddp \
--save_frequency 5 \
--drop_path_rate 0.5 \
--label_smoothing 0.1 \
--input_size 224
currently, only facebook research's mae is supported for self-supervised learning
python -m pretraining --save_dir pretrained_weights \
--model_name facebook/vit-mae-base \
--n_epoch 400 \
--batch_size 256 \
--n_worker 8 \
--n_device 8 \
--precision 16-mixed \
--strategy ddp \
--save_frequency 20 \
--input_size 224 \
--wd 0.05 \
--norm_pix_loss
python -m finetuning --save_dir weights \
--model_name facebook/vit-mae-base \
--pretrained_dir pretrained_weights \
--n_epoch 200 \
--batch_size 128 \
--n_worker 8 \
--n_device 8 \
--precision 16-mixed \
--strategy ddp \
--save_frequency 5 \
--input_size 224 \
--drop_path_rate 0.1 \
--label_smoothing 0.1 \
--wd 0.05
-
check
results/
-
mae_vit_base
- it takes about 48 hours for pretraining using
8 x RTX 3090
- it takes about 36 hours for finetuning using
8 x RTX 3090
- it takes about 48 hours for pretraining using
metric | mae_vit_base | vit_base |
---|---|---|
top1_acc | 81.24 | 78.47 |
this project makes use of the following libraries and models