LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
Qifeng Cai, Hao Liang, Zhaoyang Han, Hejun Dong, Meiyi Qiang, Ruichuan An, Quanqing Xu, Bin Cui, Wentao Zhang

TL;DR
LoVR is a new benchmark for long video-text retrieval featuring longer videos, detailed captions, and a scalable annotation framework, designed to challenge and advance multimodal video understanding methods.
Contribution
The paper introduces LoVR, a comprehensive benchmark with high-quality annotations and a novel caption generation pipeline for long videos, addressing limitations of existing datasets.
Findings
LoVR is more challenging than existing benchmarks.
Current models show limited performance on LoVR.
The proposed annotation framework improves caption quality.
Abstract
Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
