Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Fanheng Kong; Jingyuan Zhang; Yahui Liu; Hongzhi Zhang; Shi Feng; Xiaocui Yang; Daling Wang; Yu Tian; Victoria W.; Fuzheng Zhang; Guorui Zhou

arXiv:2505.19650·cs.CV·May 28, 2025

Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou

PDF

Open Access 1 Repo 4 Models 2 Datasets 3 Reviews

TL;DR

This paper introduces UNITE, a comprehensive framework for multimodal information retrieval that emphasizes data curation and modality-aware training to improve cross-modal representations and achieve state-of-the-art results.

Contribution

The work presents the first systematic analysis of modality-specific data properties and introduces MAMCL, a novel contrastive learning method for better cross-modal alignment.

Findings

01

Achieves state-of-the-art results on multiple MIR benchmarks.

02

Demonstrates the importance of modality curation and tailored training protocols.

03

Provides a foundational blueprint for future multimodal research.

Abstract

Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

* Comprehensive evaluation across datasets and tasks. * Simple, yet effective strategy. * Good experimental results that indicate effectiveness of the proposed strategy. * Generally speaking the study seems well designed and well conducted.

Weaknesses

* Writing style is too distracting. Every time the paper indicates that something is critical or crucial, it is not. * The contribution seems incremental with tweaks to existing models and based primarily in data curation. * The paper reports results that they say contradicts previous observations, but no reference results or citations are provided. * The interpretation of results and the insights is limited to highlighting numeric differences.

Reviewer 02Rating 6Confidence 4

Strengths

The paper introduces a novel Modality-Aware Masked Contrastive Learning (MAMCL) approach that extends contrastive learning to better accommodate heterogeneous modalities. The framework's ability to integrate video retrieval alongside text and image retrieval broadens its applicability and underscores the model's versatility in handling complex multimodal data. Comprehensive evaluations across diverse multimodal benchmarks substantiate the benefits of both MAMCL and the modality-aware data design

Weaknesses

While the paper presents a well-motivated and empirically supported approach, several aspects could be strengthened to enhance its clarity and overall impact. 1- Frozen projector and vision encoder: The authors freeze the projector and vision encoder, but do not analyze the implications of this choice. It remains unclear how fine-tuning these components—particularly during instruction tuning—might affect multimodal alignment and retrieval performance. An ablation study comparing frozen versus tr

Reviewer 03Rating 6Confidence 4

Strengths

1. Broad generalization and strong performance in video retrieval: the proposed UNITE models perform strong across various retrieval scenarios, tasks, and granularities. On WebVid-CoVR, UNITE_instruct-7B exceeds baselines under their reported settings. 2. Proper ablations: The paper includes a dedicated MAMCL ablation (Table 7) and a full training-data composition analysis (TT/TI/TV mix, under fixed data budget).

Weaknesses

1. Marginal performance of the MAMCL component: while MAMCL is conceptually sound, its average gains are small (about +0.3 overall on MMEB, avg of +0.5 on WebVid-CoVR with 7B parameters), and it can trade off specific metrics (e.g., CoVR R@5 at 7B). I recommend deeper analysis on when/why it helps. 2. Lack of efficiency analysis: MAMCL changes the effective negative set via a modality mask, but the paper does not report compute comparisons to standard InfoNCE; only high-level training setup (e.g

Code & Models

Repositories

friedrichor/UNITE
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsContrastive Learning