OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
Yuanhao Yue, Chengyu Wang, Yuanjie Lyu, Lei Shen, Jun Huang

TL;DR
OmniThoughtVis introduces a scalable pipeline for creating high-quality multimodal reasoning data and distilling large models into smaller, efficient models with improved reasoning capabilities for real-world deployment.
Contribution
The paper presents a novel data curation and distillation pipeline that transfers reasoning skills from large models to smaller ones, enhancing their performance on multimodal tasks.
Findings
Distilled models outperform baseline models on multiple benchmarks.
The pipeline creates a high-quality dataset of 1.8 million samples.
Smaller models achieve comparable or better performance than larger models.
Abstract
Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
