Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou

TL;DR
This paper introduces UNITE, a comprehensive framework for multimodal information retrieval that emphasizes data curation and modality-aware training to improve cross-modal representations and achieve state-of-the-art results.
Contribution
The work presents the first systematic analysis of modality-specific data properties and introduces MAMCL, a novel contrastive learning method for better cross-modal alignment.
Findings
Achieves state-of-the-art results on multiple MIR benchmarks.
Demonstrates the importance of modality curation and tailored training protocols.
Provides a foundational blueprint for future multimodal research.
Abstract
Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks,…
Peer Reviews
Decision·Submitted to ICLR 2026
* Comprehensive evaluation across datasets and tasks. * Simple, yet effective strategy. * Good experimental results that indicate effectiveness of the proposed strategy. * Generally speaking the study seems well designed and well conducted.
* Writing style is too distracting. Every time the paper indicates that something is critical or crucial, it is not. * The contribution seems incremental with tweaks to existing models and based primarily in data curation. * The paper reports results that they say contradicts previous observations, but no reference results or citations are provided. * The interpretation of results and the insights is limited to highlighting numeric differences.
The paper introduces a novel Modality-Aware Masked Contrastive Learning (MAMCL) approach that extends contrastive learning to better accommodate heterogeneous modalities. The framework's ability to integrate video retrieval alongside text and image retrieval broadens its applicability and underscores the model's versatility in handling complex multimodal data. Comprehensive evaluations across diverse multimodal benchmarks substantiate the benefits of both MAMCL and the modality-aware data design
While the paper presents a well-motivated and empirically supported approach, several aspects could be strengthened to enhance its clarity and overall impact. 1- Frozen projector and vision encoder: The authors freeze the projector and vision encoder, but do not analyze the implications of this choice. It remains unclear how fine-tuning these components—particularly during instruction tuning—might affect multimodal alignment and retrieval performance. An ablation study comparing frozen versus tr
1. Broad generalization and strong performance in video retrieval: the proposed UNITE models perform strong across various retrieval scenarios, tasks, and granularities. On WebVid-CoVR, UNITE_instruct-7B exceeds baselines under their reported settings. 2. Proper ablations: The paper includes a dedicated MAMCL ablation (Table 7) and a full training-data composition analysis (TT/TI/TV mix, under fixed data budget).
1. Marginal performance of the MAMCL component: while MAMCL is conceptually sound, its average gains are small (about +0.3 overall on MMEB, avg of +0.5 on WebVid-CoVR with 7B parameters), and it can trade off specific metrics (e.g., CoVR R@5 at 7B). I recommend deeper analysis on when/why it helps. 2. Lack of efficiency analysis: MAMCL changes the effective negative set via a modality mask, but the paper does not report compute comparisons to standard InfoNCE; only high-level training setup (e.g
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsContrastive Learning
