V-CORE: Temporally Consistent Video Understanding for Video-LLM
Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, Ye Zhang

TL;DR
V-CORE introduces a novel framework with explicit temporal ordering constraints for video understanding, improving causal reasoning and temporal coherence in Video-LLMs while maintaining efficiency.
Contribution
It proposes a parameter-efficient architecture with structured unidirectional temporal modeling and spatial token selection, addressing limitations of previous bidirectional approaches.
Findings
Achieves 61.2% accuracy on NExT-QA benchmark.
Improves temporal and causal reasoning performance by +3.5% and +5.2%.
Maintains competitive results across multiple video QA datasets.
Abstract
Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
