Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu

TL;DR
This paper introduces a comprehensive framework for universal video retrieval, including a new benchmark, a large-scale data synthesis process, and a curriculum-based training method, significantly improving zero-shot generalization across diverse tasks.
Contribution
It presents the UVRB benchmark, a scalable data synthesis workflow, and the Modality Pyramid curriculum, enabling the GVE model to generalize across multiple video retrieval tasks.
Findings
GVE achieves state-of-the-art zero-shot performance on UVRB.
Popular benchmarks poorly predict general video retrieval ability.
Partially relevant retrieval is a common but overlooked scenario.
Abstract
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE)…
Peer Reviews
Decision·Submitted to ICLR 2026
- Holistic Framework: The central strength is the novel co-design of evaluation, data, and modeling. This holistic approach breaks the cycle of narrow benchmarks leading to specialized models and provides a scalable path forward. - Comprehensive Benchmark: The creation of a large-scale, diagnostic benchmark is a significant and lasting contribution that will benefit the entire research community. - SOTA Performance: The proposed GVE model demonstrates impressive state-of-the-art performance in a
- Narrow Domain Coverage: UVRB does not include specialized domains (e.g., medical, industrial, surveillance), where visual semantics and query intent differ significantly. Extending the benchmark to these domains would enhance generalizability claims.
- The paper investigates a new Universal Video Retrieval (UVR) task and evaluates the proposed method across a diverse set of benchmarks, demonstrating strong performance relative to existing baseline approaches. - The proposed method and architecture are relatively simple in design, yet they prove to be effective across a wide range of video retrieval tasks.
- While the paper argues that UVRB is a new benchmark, it seems like the benchmark is just a combination of prior works. - The distinction between the proposed approach and prior work, such as UNITE, is not clearly articulated, making it difficult to assess the novelty of the contribution. - The data generation pipeline should be compared with existing baselines; however, such comparisons are either missing or insufficiently discussed, limiting the understanding of its advantages or uniqueness.
Holistic Framework: The main strength is the ambitious and well-executed "evaluation-data-training" co-design. This approach moves beyond incremental model improvements to address a systemic issue in the field. Comprehensive Benchmark (UVRB): The creation of UVRB is a major contribution that allows for a much more nuanced and diagnostic evaluation of video retrieval models than was previously possible. Strong Empirical Results: The GVE model demonstrates superior zero-shot performance across n
Reliance on Synthetic Data: While the synthesis pipeline is sophisticated, it relies on an MLLM captioner. This introduces a potential for model-inherent biases or systematic errors in the training data that may not reflect real-world human annotations. The authors were clearly aware of the "garbage in, garbage out" problem. Their V-SynFlow pipeline includes a "Multi-granular Quality Control" stage as a first line of defense. This pre-filtering aims to ensure the MLLM captioner starts with a cle
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Video Analysis and Summarization
