Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning

Wenting Lu; Didi Zhu; Tao Shen; Donglin Zhu; Ayong Ye; Chao Wu

arXiv:2601.02422·cs.CV·January 7, 2026

Watch Wider and Think Deeper: Collaborative Cross-modal Chain-of-Thought for Complex Visual Reasoning

Wenting Lu, Didi Zhu, Tao Shen, Donglin Zhu, Ayong Ye, Chao Wu

PDF

Open Access 1 Datasets

TL;DR

The paper introduces CoCoT, a novel multi-modal reasoning framework that dynamically grounds relevant image regions and enables multi-region collaboration, significantly improving complex visual reasoning accuracy.

Contribution

It proposes the CoCoT framework with dynamic multi-region grounding and relation-aware reasoning, along with a large-scale dataset for structured multi-region visual reasoning.

Findings

01

Achieves 15.4% accuracy improvement on LLaVA-1.5

02

Achieves 4.0% accuracy improvement on Qwen2-VL

03

Demonstrates effectiveness across six benchmarks

Abstract

Multi-modal reasoning requires the seamless integration of visual and linguistic cues, yet existing Chain-of-Thought methods suffer from two critical limitations in cross-modal scenarios: (1) over-reliance on single coarse-grained image regions, and (2) semantic fragmentation between successive reasoning steps. To address these issues, we propose the CoCoT (Collaborative Coross-modal Thought) framework, built upon two key innovations: a) Dynamic Multi-Region Grounding to adaptively detect the most relevant image regions based on the question, and b) Relation-Aware Reasoning to enable multi-region collaboration by iteratively aligning visual cues to form a coherent and logical chain of thought. Through this approach, we construct the CoCoT-70K dataset, comprising 74,691 high-quality samples with multi-region annotations and structured reasoning chains. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

echo-deer/cocot
dataset· 358 dl
358 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning