This project aims to efficiently develop multi-modal data synthesis pipelines of multiple LLMs/MLLMs.
MM-INF is a multimodal data synthesis framework that can generate diverse and high-quality multimodal data. It contains a automated pipeline of multiple LLMs/MLLMs, which requires only a single yaml file to configure. The pipeline can be easily extended to support new tasks.
This repo contains an official implementation of Oasis, a multimodal data synthesis framework. This method can generate diverse and high-quality multimodal instruction-response data based only on images, without any prior prompt.
[Read the Paper] | [Hugging Face Dataset]
This repository relies on swift and vllm environment.
conda create -n mminf python=3.10 --y
conda activate mminf
pip install ms-swift==3.2.1
# The default CUDA version is 12.4. To use a different CUDA version(12.1), run:
# Please make sure the torch version aligns with the vllm version.
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 --force-reinstall
pip install vllm==0.7.3
pip install qwen_vl_utils
To run the Oasis synthesis pipeline, you can run the following command:
- Note: replace YOUR_PATH_TO in yaml file with your own path or huggingface dataset repo.
bash script/oasis.sh
# or
python3 infer_pipeline.py \
--config_file config/oasis.yaml \
--no-enable_history
To run a caption generation task, you can run the following command:
bash script/caption_gen.sh
# or
python3 infer_pipeline.py \
--config_file config/caption_gen.yaml \
--enable_history
Please refer to Commands documents for more details about the command line arguments.
- ✨
config/oasis.yaml
: Generate multimodal training data based on only image.- Input images only, truncate input tokens, let MLLM generate user content
- Do quality control on instructions
- Generate corresponding response as SFT data for each instruction
config/VLThinking.yaml
: A reproduction of VL-Thinking, an R1-derived visual instruction tuning dataset.- Generate caption based on image
- Generate CoT data with R1-like model
- Rewrite the response
- Verify the correctness of the CoT response
config/caption_gen.yaml
: Generate detailed image caption.config/caption_conversation.yaml
: Generate multimodal training data based on image and caption.- Input images and caption, truncate input tokens, let MLLM generate user content
- Filter out instructions, do quality control on instructions
- Generate corresponding response as SFT data for each instruction
config/caption_hallucination.yaml
: Generate hallucination-free caption based on image.- Generate caption based on image
- Divide caption into several paragraphs, judge whether each paragraph has hallucination
- Keep caption that has no hallucination in all paragraphs
config/detailed_inst.yaml
: Generate detailed instruction based on a caption of an image.config/prompt_response_match_score.yaml
(Chinese):Score the match between instruction and response.- Input instruction and response, let MLLM judge whether the response is correct
config/prompt_evolve.yaml
(Chinese): Evolve the instruction-response pair.- First score (original instruction-response match score)
- Generate new instruction based on instruction and response, the purpose is to improve match score
- Generate new response based on new instruction
- Second score
This project can be used to implement a pipeline of multiple LLMs/MLLMs. User needs to fill in the configuration file, prompt and post-processing function.
The main function is infer_pipeline.py
. The config file is placed in config/
and the post-processing function is placed in pp_func/
.
To learn more about the task configuration or custom tasks, please refer to the Custom Tasks.
- Interruptible: The amount of processed data for each step is saved in
save_dir/cache/cache.jsonl
, allowing users to resume from a breakpoint. As long as the process remains unchanged, users can stop the inference at any time. - History: The history of each step is saved, allowing users to view the inference history of each step.
Join our WeChat group.
@article{zhang2025oasis,
title={Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis},
author={Zhang, Letian and Cui, Quan and Zhao, Bingchen and Yang, Cheng},
journal={arXiv preprint arXiv:2503.08741},
year={2025}
}