Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Xiangyu Zeng; Zhiqiu Zhang; Yuhan Zhu; Xinhao Li; Zikang Wang; Changlian Ma; Qingyu Zhang; Zizheng Huang; Kun Ouyang; Tianxiang Jiang; Ziang Yan; Yi Wang; Hongjie Zhang; Yali Wang; Limin Wang

arXiv:2601.23224·cs.CV·May 22, 2026

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, Ziang Yan, Yi Wang, Hongjie Zhang, Yali Wang, Limin Wang

PDF

TL;DR

Video-o3 introduces an iterative, evidence-focused framework for long-video reasoning that enhances accuracy by selectively identifying critical clues and controlling reasoning complexity.

Contribution

It proposes novel attention and reward mechanisms for multi-turn reasoning, along with a large-scale dataset for training and evaluation.

Findings

01

Achieves 72.1% accuracy on MLVU

02

Achieves 46.5% accuracy on Video-Holmes

03

Outperforms state-of-the-art methods in long-video reasoning

Abstract

Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning