TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

Zongsheng Cao; Yangfan He; Anran Liu; Feng Chen; Zepeng Wang; and Jun Xie

arXiv:2512.23483·cs.CV·December 30, 2025

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, and Jun Xie

PDF

Open Access

TL;DR

TV-RAG is a training-free framework that enhances long video retrieval and understanding by integrating temporal alignment and semantic entropy weighting, leading to improved reasoning without additional training.

Contribution

It introduces a novel, training-free architecture combining temporal offset alignment and entropy-guided key-frame sampling for better long-video reasoning.

Findings

01

Outperforms leading baselines on Video-MME, MLVU, and LongVideoBench.

02

Provides a lightweight, training-free upgrade for existing LVLMs.

03

Effectively captures fine-grained semantic shifts over extended video durations.

Abstract

Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling