MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts| Arxiv
Official Implementation of MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts
TL;DR: MoPE is a prompt-based method to fuse unimodal pretrained models (e.g., ViT, Bert, Wav2Vec) for downstream multimodal tasks. MoPE is parameter-efficient and scalable, and achieves state-of-the-art performance on various multimodal tasks.
The key innovation of MoPE is that we decompose long prompts into short and specialized prompt experts, which are routed instance-wisely with a multimodal router.
🔥 🔥 Update (2025/3/11): We released preliminary code for MoPE for vision-language classification.
First install dependencies (tested with Python=3.8)
pip install -r requirements.txt
Download the pretrained Swin-base model from here (alternative link here). Put the swin_base_patch4_window7_224_22k.pth
into the pretrained
folder.
Download datasets put them into the data
folder. The classification datasets include: UPMC Food-101, SNLI-VE, and MM-IMDB. For MUStARD dataset, refer to here.
After downloading, you may use utils/process_food_101.py
, utils/process_mm_imdb.py
to preprocess the datasets into JsonL format.
The data folder should look like this:
data
├── food-101
├── mmimdb
├── snli_ve
Train the MoPE model use the following command, for example on MM-IMDB:
python main_classify.py --exp_name full_imdb --use_vpt --use_pbert --fuse_method mope --train_instructor --dataset imdb --prompt_length 6 --moe_n_experts 4 --t_prompt_length 4 --lr_vis 4e-4 --lr_text 5e-4 --w_imp 0.01 --use_instruct
Monitor the training process with tensorboard:
tensorboard --logdir /logs --port 6006
If you have any questions, please feel free to contact me via email or Github issue. If you find this project useful, please consider citing our paper.
@article{jiang2024mope,
title={MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts},
author={Jiang, Ruixiang and Liu, Lingbo and Chen, Changwen},
journal={arXiv preprint arXiv:2403.10568},
year={2024}
}