Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo; Mingxin Li; Yanzhao Zhang; Dingkun Long; Pengjun Xie; Xiaowen Chu

arXiv:2510.27571·cs.CV·November 3, 2025

Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

PDF

Open Access 2 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces a comprehensive framework for universal video retrieval, including a new benchmark, a large-scale data synthesis process, and a curriculum-based training method, significantly improving zero-shot generalization across diverse tasks.

Contribution

It presents the UVRB benchmark, a scalable data synthesis workflow, and the Modality Pyramid curriculum, enabling the GVE model to generalize across multiple video retrieval tasks.

Findings

01

GVE achieves state-of-the-art zero-shot performance on UVRB.

02

Popular benchmarks poorly predict general video retrieval ability.

03

Partially relevant retrieval is a common but overlooked scenario.

Abstract

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE)…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Holistic Framework: The central strength is the novel co-design of evaluation, data, and modeling. This holistic approach breaks the cycle of narrow benchmarks leading to specialized models and provides a scalable path forward. - Comprehensive Benchmark: The creation of a large-scale, diagnostic benchmark is a significant and lasting contribution that will benefit the entire research community. - SOTA Performance: The proposed GVE model demonstrates impressive state-of-the-art performance in a

Weaknesses

- Narrow Domain Coverage: UVRB does not include specialized domains (e.g., medical, industrial, surveillance), where visual semantics and query intent differ significantly. Extending the benchmark to these domains would enhance generalizability claims.

Reviewer 02Rating 4Confidence 3

Strengths

- The paper investigates a new Universal Video Retrieval (UVR) task and evaluates the proposed method across a diverse set of benchmarks, demonstrating strong performance relative to existing baseline approaches. - The proposed method and architecture are relatively simple in design, yet they prove to be effective across a wide range of video retrieval tasks.

Weaknesses

- While the paper argues that UVRB is a new benchmark, it seems like the benchmark is just a combination of prior works. - The distinction between the proposed approach and prior work, such as UNITE, is not clearly articulated, making it difficult to assess the novelty of the contribution. - The data generation pipeline should be compared with existing baselines; however, such comparisons are either missing or insufficiently discussed, limiting the understanding of its advantages or uniqueness.

Reviewer 03Rating 6Confidence 4

Strengths

Holistic Framework: The main strength is the ambitious and well-executed "evaluation-data-training" co-design. This approach moves beyond incremental model improvements to address a systemic issue in the field. Comprehensive Benchmark (UVRB): The creation of UVRB is a major contribution that allows for a much more nuanced and diagnostic evaluation of video retrieval models than was previously possible. Strong Empirical Results: The GVE model demonstrates superior zero-shot performance across n

Weaknesses

Reliance on Synthetic Data: While the synthesis pipeline is sophisticated, it relies on an MLLM captioner. This introduces a potential for model-inherent biases or systematic errors in the training data that may not reflect real-world human annotations. The authors were clearly aware of the "garbage in, garbage out" problem. Their V-SynFlow pipeline includes a "Multi-granular Quality Control" stage as a first line of defense. This pre-filtering aims to ensure the MLLM captioner starts with a cle

Code & Models

Models

Datasets

Alibaba-NLP/UVRB
dataset· 1.4k dl
1.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Video Analysis and Summarization