V-CORE: Temporally Consistent Video Understanding for Video-LLM

Zhengjian Kang; Qi Chen; Rui Liu; Kangtong Mo; Xingyu Zhang; Xiaoyu Deng; Ye Zhang

arXiv:2601.01804·cs.CV·March 17, 2026

V-CORE: Temporally Consistent Video Understanding for Video-LLM

Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, Ye Zhang

PDF

Open Access

TL;DR

V-CORE introduces a novel framework with explicit temporal ordering constraints for video understanding, improving causal reasoning and temporal coherence in Video-LLMs while maintaining efficiency.

Contribution

It proposes a parameter-efficient architecture with structured unidirectional temporal modeling and spatial token selection, addressing limitations of previous bidirectional approaches.

Findings

01

Achieves 61.2% accuracy on NExT-QA benchmark.

02

Improves temporal and causal reasoning performance by +3.5% and +5.2%.

03

Maintains competitive results across multiple video QA datasets.

Abstract

Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)