DeepSeek-R1 demonstrates outstanding reasoning abilities when tackling math, coding, puzzle, and science problems, as well as responding to general inquiries. However, as a text-only reasoning model, R1 cannot process multimodal inputs like images, which limits its practicality in certain situations. Exploring the potential for multimodal reasoning is an intriguing prospect.
We build this project to create a model that can reason with both text and images.
2025/02/08
: We are excited to announce the release of the initial version of our cold-start dataset on 🤗 HuggingFace. This first trial employs theDeepSeek-R1-Distill-Qwen-32B
model for reasoning and theGPT-4o-mini
model for image captioning and data formatting.
Note
We are actively working on developing enhanced versions that will:
- Incorporate more powerful models.
- Increase task diversity.
- Improve sample quality.
Stay tuned for updates!
We explore distilling the strong reasoning capabilities from a Large Language Model (LLM) such as R1 to a Large Vision-Language Model (LVLM). Specifically, we utilize three kinds of data, including:
- Text Data: Text-only reasoning datasets.
- Text Rendering Data: Curated from text-only reasoning datasets, ultilizing a reformatting and rendering pipeline. We adopt these data to encourage identical response to different modality inputs.
- Multimodal Data: Curated from raw multimodal datasets. We adopt a simple strategy to mitigate the absence of vision capabilities in text-only reasoning models, called Caption-Prefixing.
Type | Source Dataset | Numbers |
---|---|---|
Text | Bespoke-Stratos-17k | 16.7k |
Text Rendering | Bespoke-Stratos-17k | 12.6k |
Multimodal | AI2D | 7.8k |
Text / Multimodal | ScienceQA | 9.9k |
Multimodal | PixMo-Cap-QA | 19.4k |
Similar to the reasoning forcing trick, we make the model pretend to "see" the image by captioning the image in the beginning of the thinking process. We use a simple template:
# English
prefix_en = "<think>\nHmm, to solve this problem, let's first take a look at the image. {}\n\nNow".format(image_caption)
# Chinese
prefix_zh = "<think>\n嗯,为了回答这个问题,让我先看一下图片。{}\n\n首先".format(image_caption)
There are two important switches for entering and exiting the caption mode:
- Enter:
let's first take a look at the image.
- Exit:
Now
Tip
It is worth noting that the exit switch originates from the original R1 thought process, which helps the model stop captioning and avoids hallucinations.
This method achieved well-formatted thoughts and solutions, without the need for heavy post-processings like LLM reformatting. To clarify, we can diversify the switch styles by string replacement.
- Reformatting the original question with an LLM.
- Rendering the reformatted LaTeX files on images.
TODO: Train and evaluate TextHawk2-7B and Qwen2.5-VL-7B.
TODO: Explore RL for LVLMs.
If you find this project useful in your research, please consider cite:
@misc{yu25r1vision,
author = {Ya{-}Qi Yu and Minghui Liao and and Feilong Chen and Jihao Wu and Chao Weng},
title = {R1-Vision: Let's first take a look at the image},
howpublished = {\url{https://github.com/yuyq96/R1-Vision}},
note = {Accessed: 2025-02-08},
year = {2025}
}
R1-Vision is built with reference to the code or data of the following projects: DeepSeek-R1, Bespoke-Stratos-17k, AI2D, ScienceQA, PixMo. Thanks for their awesome work!