Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu; Chuanbin Liu; Yifan Ge; Peicheng Zhou; Yiwei Sun; Zhiying Lu; Zhangchi Hu; and Hongtao Xie

arXiv:2511.18463·cs.CV·March 13, 2026

Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Zhangchi Hu, and Hongtao Xie

PDF

Open Access

TL;DR

This paper introduces a structured approach to separate perception from reasoning in video understanding models, improving hallucination resistance and reasoning accuracy through explicit supervision and perception-based rewards.

Contribution

It proposes Decoupled Perception and Logic (DPL), a structured perception representation, and a perception reward to enhance hallucination resistance and reasoning in video models.

Findings

01

Video-DPL improves hallucination resistance.

02

Structured perception enables better alignment and supervision.

03

Higher data efficiency in training models.

Abstract

Video Large Language Models improve reasoning over complex videos by generating intermediate reasoning text. However, reliable reasoning depends on accurate video perception. In existing approaches, perception evidence is intertwined with reasoning text, making it difficult to directly supervise the perception process. We argue that reliable supervision requires explicitly separating perception evidence from reasoning so that perception can be verified independently. To supervise perception directly, we propose Decoupled Perception and Logic (DPL), which represents perception as fixed-format evidence units containing timestamps and visual descriptions. This structured representation enables direct extraction of perception content and simplifies alignment between video segments and reward evaluation. Building on DPL, we introduce a perception reward that encourages both hallucination…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Visual Attention and Saliency Detection · Adversarial Robustness in Machine Learning