Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning, Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

TL;DR
This paper explores fine-tuning multimodal large language models with long-form textual reasoning data to enable slow-thinking capabilities, demonstrating that language-based reasoning transfer is effective across modalities.
Contribution
It introduces Virgo, a multimodal slow-thinking system trained with textual reasoning data, showing that slow-thinking abilities are primarily linked to the language model component.
Findings
Textual reasoning data effectively transfers slow-thinking capabilities.
Long-form reasoning can be more effective than visual data in eliciting slow-thinking.
Slow-thinking capacities are fundamentally associated with the language model component.
Abstract
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
