Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models
Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

TL;DR
Insight-V++ advances multimodal large language models by developing a multi-agent framework with novel data synthesis and training algorithms, significantly improving long-chain visual reasoning on images and videos.
Contribution
The paper introduces a unified multi-agent reasoning framework with new data generation and training algorithms, enabling enhanced long-horizon visual reasoning in multimodal LLMs.
Findings
Significant performance improvements on image and video reasoning benchmarks.
Effective self-improving training loop with iterative reasoning path generation.
Preserves capabilities on perception-focused tasks.
Abstract
Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
