LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai; Hao Liang; Zhaoyang Han; Hejun Dong; Meiyi Qiang; Ruichuan An; Quanqing Xu; Bin Cui; Wentao Zhang

arXiv:2505.13928·cs.CV·February 5, 2026

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Qifeng Cai, Hao Liang, Zhaoyang Han, Hejun Dong, Meiyi Qiang, Ruichuan An, Quanqing Xu, Bin Cui, Wentao Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

LoVR is a new benchmark for long video-text retrieval featuring longer videos, detailed captions, and a scalable annotation framework, designed to challenge and advance multimodal video understanding methods.

Contribution

The paper introduces LoVR, a comprehensive benchmark with high-quality annotations and a novel caption generation pipeline for long videos, addressing limitations of existing datasets.

Findings

01

LoVR is more challenging than existing benchmarks.

02

Current models show limited performance on LoVR.

03

The proposed annotation framework improves caption quality.

Abstract

Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

technomad-ds/lovr-benchmark
pytorchOfficial

Datasets

debugger123/LoVR-benchmark
dataset· 142 dl
142 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques