VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing

Hongbo Jin; Jiayu Ding; Siyi Xie; Guibo Luo; Ge Li

arXiv:2505.11830·cs.CV·January 8, 2026

VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing

Hongbo Jin, Jiayu Ding, Siyi Xie, Guibo Luo, Ge Li

PDF

Open Access

TL;DR

VISTA is a training-free framework that improves video-language reasoning by dynamically routing inference paths and aligning perception with logic, addressing Semantic Inertia in Video-LLMs.

Contribution

It introduces a novel, training-free method for Video-LLMs that mitigates Semantic Inertia through dynamic routing and explicit visual-textual alignment.

Findings

01

VISTA outperforms base models by 9.3% on Egochema.

02

VISTA improves accuracy by 5.6% on VideoEspresso.

03

VISTA rivals larger, proprietary models in benchmarks.

Abstract

Recent advancements in Large Language Models have successfully transitioned towards System 2 reasoning, yet applying these paradigms to video understanding remains challenging. While prevailing research attributes failures in Video-LLMs to perceptual limitations, our empirical analysis reveals a cognitive misalignment termed Semantic Inertia, where models suppress valid visual evidence in favor of dominant language priors. To rectify this, we propose VISTA, a training-free framework designed to align perception with logical deduction. By dynamically routing inference paths and materializing implicit visual features into explicit textual anchors, our approach effectively counterbalances the influence of parametric knowledge. Furthermore, we incorporate a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations. VISTA showed outstanding results on a wide range of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)

MethodsBalanced Selection