VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing
Hongbo Jin, Jiayu Ding, Siyi Xie, Guibo Luo, Ge Li

TL;DR
VISTA is a training-free framework that improves video-language reasoning by dynamically routing inference paths and aligning perception with logic, addressing Semantic Inertia in Video-LLMs.
Contribution
It introduces a novel, training-free method for Video-LLMs that mitigates Semantic Inertia through dynamic routing and explicit visual-textual alignment.
Findings
VISTA outperforms base models by 9.3% on Egochema.
VISTA improves accuracy by 5.6% on VideoEspresso.
VISTA rivals larger, proprietary models in benchmarks.
Abstract
Recent advancements in Large Language Models have successfully transitioned towards System 2 reasoning, yet applying these paradigms to video understanding remains challenging. While prevailing research attributes failures in Video-LLMs to perceptual limitations, our empirical analysis reveals a cognitive misalignment termed Semantic Inertia, where models suppress valid visual evidence in favor of dominant language priors. To rectify this, we propose VISTA, a training-free framework designed to align perception with logical deduction. By dynamically routing inference paths and materializing implicit visual features into explicit textual anchors, our approach effectively counterbalances the influence of parametric knowledge. Furthermore, we incorporate a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations. VISTA showed outstanding results on a wide range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
MethodsBalanced Selection
