TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
Zongsheng Cao, Yangfan He, Anran Liu, Feng Chen, Zepeng Wang, and Jun Xie

TL;DR
TV-RAG is a training-free framework that enhances long video retrieval and understanding by integrating temporal alignment and semantic entropy weighting, leading to improved reasoning without additional training.
Contribution
It introduces a novel, training-free architecture combining temporal offset alignment and entropy-guided key-frame sampling for better long-video reasoning.
Findings
Outperforms leading baselines on Video-MME, MLVU, and LongVideoBench.
Provides a lightweight, training-free upgrade for existing LVLMs.
Effectively captures fine-grained semantic shifts over extended video durations.
Abstract
Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling
