Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong; Zuyan Liu; Shulin Tian; Yongming Rao; Ziwei Liu

arXiv:2603.18118·cs.CV·March 20, 2026

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

PDF

Open Access

TL;DR

Insight-V++ advances multimodal large language models by developing a multi-agent framework with novel data synthesis and training algorithms, significantly improving long-chain visual reasoning on images and videos.

Contribution

The paper introduces a unified multi-agent reasoning framework with new data generation and training algorithms, enabling enhanced long-horizon visual reasoning in multimodal LLMs.

Findings

01

Significant performance improvements on image and video reasoning benchmarks.

02

Effective self-improving training loop with iterative reasoning path generation.

03

Preserves capabilities on perception-focused tasks.

Abstract

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis