OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue; Chengyu Wang; Yuanjie Lyu; Lei Shen; Jun Huang

arXiv:2605.11629·cs.CL·May 13, 2026

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue, Chengyu Wang, Yuanjie Lyu, Lei Shen, Jun Huang

PDF

TL;DR

OmniThoughtVis introduces a scalable pipeline for creating high-quality multimodal reasoning data and distilling large models into smaller, efficient models with improved reasoning capabilities for real-world deployment.

Contribution

The paper presents a novel data curation and distillation pipeline that transfers reasoning skills from large models to smaller ones, enhancing their performance on multimodal tasks.

Findings

01

Distilled models outperform baseline models on multiple benchmarks.

02

The pipeline creates a high-quality dataset of 1.8 million samples.

03

Smaller models achieve comparable or better performance than larger models.

Abstract

Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.