Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding
Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Zhangchi Hu, and Hongtao Xie

TL;DR
This paper introduces a structured approach to separate perception from reasoning in video understanding models, improving hallucination resistance and reasoning accuracy through explicit supervision and perception-based rewards.
Contribution
It proposes Decoupled Perception and Logic (DPL), a structured perception representation, and a perception reward to enhance hallucination resistance and reasoning in video models.
Findings
Video-DPL improves hallucination resistance.
Structured perception enables better alignment and supervision.
Higher data efficiency in training models.
Abstract
Video Large Language Models improve reasoning over complex videos by generating intermediate reasoning text. However, reliable reasoning depends on accurate video perception. In existing approaches, perception evidence is intertwined with reasoning text, making it difficult to directly supervise the perception process. We argue that reliable supervision requires explicitly separating perception evidence from reasoning so that perception can be verified independently. To supervise perception directly, we propose Decoupled Perception and Logic (DPL), which represents perception as fixed-format evidence units containing timestamps and visual descriptions. This structured representation enables direct extraction of perception content and simplifies alignment between video segments and reward evaluation. Building on DPL, we introduce a perception reward that encourages both hallucination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Visual Attention and Saliency Detection · Adversarial Robustness in Machine Learning
