STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Linfeng Fan; Yuan Tian; Ziwei Li; Zhiwu Lu

arXiv:2604.03045·cs.CV·April 6, 2026

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Linfeng Fan, Yuan Tian, Ziwei Li, Zhiwu Lu

PDF

TL;DR

STEAR is a layer-aware intervention framework that reduces hallucinations in Video-LLMs by targeting specific layers with visual evidence, improving faithfulness and temporal consistency.

Contribution

It introduces a novel layer-aware evidence intervention method that selectively corrects hallucinations at different decoder layers in Video-LLMs.

Findings

01

Consistently reduces spatial and temporal hallucinations across benchmarks.

02

Improves faithfulness, temporal consistency, and robustness of Video-LLMs.

03

Efficient single-encode inference framework for hallucination mitigation.

Abstract

Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.