Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang; Yuanxin Liu; Linli Yao; Yishuo Cai; Hao Zhou; Jie Zhou; Fandong Meng; Xu Sun

arXiv:2510.20470·cs.CV·November 21, 2025

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

PDF

Open Access

TL;DR

Conan is a novel framework that enables evidence-grounded multi-step video reasoning by identifying relevant frames, reasoning over cross-frame clues, and adaptively deciding when to conclude, significantly improving accuracy on multiple benchmarks.

Contribution

The paper introduces Conan, a new multi-stage progressive learning framework with a large-scale dataset and RL training strategy for grounded video reasoning, advancing beyond existing methods.

Findings

01

Outperforms baseline models by over 10% in accuracy on six benchmarks.

02

Effectively generalizes to long video understanding tasks.

03

Achieves state-of-the-art performance in multi-step video reasoning.

Abstract

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning