Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin

TL;DR
This paper introduces ProCauEval, a new evaluation protocol for diagnosing causal discovery deficits in Large Multimodal Models, revealing their reliance on textual priors over visual content, and proposes ADPO to improve visual grounding.
Contribution
The paper presents ProCauEval for mechanism diagnosis in LMMs and proposes ADPO, a reinforcement learning method to reduce textual prior reliance during causal reasoning.
Findings
Models perceive video content faithfully but underexploit it in causal reasoning.
Stronger post-training increases reliance on textual priors.
Higher baseline performance correlates with greater fragility under perturbation.
Abstract
Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
