MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng; Jinwei Hu; Qijia Lu; Jiawei Niu; Li Tan; Shuo Yuan; Ziyi Yan; Yizhen Jia; Qingzhi He; Shiping Ge; Ethan Q. Chen; Wentong Li; Limin Wang; Jie Qin

arXiv:2510.21406·cs.CV·October 27, 2025

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

Yue Feng, Jinwei Hu, Qijia Lu, Jiawei Niu, Li Tan, Shuo Yuan, Ziyi Yan, Yizhen Jia, Qingzhi He, Shiping Ge, Ethan Q. Chen, Wentong Li, Limin Wang, Jie Qin

PDF

Open Access

TL;DR

This paper introduces MUVR, a comprehensive benchmark for multi-modal untrimmed video retrieval focusing on long videos, multi-level visual correspondence, and evaluation of retrieval models and multimodal language models.

Contribution

It presents a new retrieval task, benchmark dataset, multi-level visual correspondence framework, and evaluation criteria for untrimmed long-video retrieval using multi-modal queries.

Findings

01

State-of-the-art models struggle with untrimmed videos and multi-modal queries.

02

Multi-level visual correspondence improves retrieval accuracy.

03

MLLMs show limitations in multi-video understanding and reranking.

Abstract

We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization