U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs
Xiaojie Li, Chu Li, Shi-Zhe Chen, Xi Chen

TL;DR
This paper systematically investigates key factors influencing embedding learning for universal multimodal retrieval using MLLMs, introduces a unified framework U-MARVEL, and demonstrates its superior performance across multiple benchmarks and tasks.
Contribution
It uncovers overlooked factors affecting retrieval performance, proposes the U-MARVEL framework, and achieves state-of-the-art results in supervised and zero-shot multimodal retrieval tasks.
Findings
Key factors like progressive transition, hard negative mining, and re-ranker distillation significantly impact performance.
U-MARVEL outperforms existing methods on the M-BEIR benchmark.
U-MARVEL shows strong zero-shot generalization on image and video retrieval tasks.
Abstract
Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on…
Peer Reviews
Decision·ICLR 2026 Poster
Strengths of this work include: - Variants of the base architecture captures most state-of-the-art systems for universal multi-modal retrieval with VLLM backbones. - The configuration/hyperparameter search space addresses several relevant questions and provides good evidence for particular design choices. - The resulting empirical performance is strong across multiple datasets and settings and is compared to multiple recent strong baseline systems.
Weaknesses of this work include: - While this is partially attributable to space limitations, it was difficult to assess the soundness of the precise method without reading the appendices (I would make some different choices in terms of details, etc.). In the same vein, many of the empirical results in the appendices are useful and not incorporated into the discussion. Finally, with respect to the empirical results, the discussion/analysis is mostly 'just' restating the tables with limited inte
1. The introduction of U-MARVEL as a unified framework is a significant contribution. It successfully integrates multiple advanced techniques to create a more efficient and effective retrieval system, and the results demonstrate clear improvements in performance compared to the state-of-the-art methods. 2. The framework is validated through extensive experiments on the M-BEIR benchmark, where U-MARVEL outperforms other methods in multiple retrieval tasks. It also shows strong zero-shot generaliz
1. Retrieval and reranking are common pipelines in information retrieval. The authors apply this framework to MLLMs, seemingly leveraging the powerful capabilities of MLLMs to obtain better embeddings. The authors should further clarify the differences and novelty of the MLLM-based framework compared to traditional retrieval frameworks. 2. The authors use Qwen2-VL-7B as the base model for experiments. However, the performance of this retrieval framework on other MLLMs is unclear. The authors sho
1. the paper offers actionable, well-motivated findings (embedding extraction, instruction masking, progressive transition, filtering of hard negatives, distillation). 2. backbone, datasets, and training configs are explicitly listed (Tables in Appendix). This makes replication feasible.
1. Limited novelty, this work is just a conbination of existing tricks in retreiver training, some techs are already presented and disscussed by previous works, such as mean token and bidirection attention [1], learnable temperature [2]. Porting them from LLM to MLLM offers little scientific significance. 2. Insufficient detail on hard-negative filtering criteria, the choice and sensitivity of the threshold for filtering presumed false negatives needs more analysis. 3. Some experimental comparis
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
