REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li; Hao Yin; Wenhui Tan; Jingyang Chen; Boshen Xu; Yuxun Qu; Yijing Chen; Jianzhong Ju; Zhenbo Luo; Jian Luan

arXiv:2511.13026·cs.CV·May 15, 2026

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

PDF

TL;DR

REVISOR introduces a multimodal introspective reasoning framework that enhances long-form video understanding by integrating visual and textual reflection, utilizing a novel reward mechanism for better evidence alignment.

Contribution

It presents REVISOR, a tool-augmented multimodal reflection framework that improves reasoning in long-form video understanding without extra supervised fine-tuning.

Findings

01

Significantly improves performance on four video understanding benchmarks.

02

Enables better visual-textual reasoning without additional supervised fine-tuning.

03

Incorporates a Dual Attribution Decoupled Reward mechanism for causal alignment.

Abstract

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.