Multimodal Data Curation Through Ranked Retrieval
Pratyush Muthukumar, Harshil Kotamreddy, Sarah Amiraslani, Tomo Kanazawa, Ramani Akkati, Shaan Jain, Andrew Mathau

TL;DR
This paper introduces a framework to improve multimodal data curation and retrieval by refining training pairs and combining multiple embedding models, significantly reducing modality bias and enhancing cross-modal alignment.
Contribution
It presents Symmetric Nucleus Subsampling and Expert Embedding Engine, novel methods for refining training data and combining embeddings to improve multimodal retrieval accuracy.
Findings
Reduces modality gap by over 90% on average.
Outperforms stratified sampling and traditional curation in downstream tasks.
Enhances cross-modal retrieval and data curation quality.
Abstract
Shared embedding spaces are widely used for multimodal search and data curation. In practice, two problems often limit how well this works. First, embeddings can reflect modality more than meaning, so examples cluster by input type even when the underlying content matches. Second, the paired supervision used to train these spaces is often noisy. When we blend many heterogeneous, human-labeled datasets, these issues reinforce each other and degrade cross-modal retrieval. We present a framework that improves alignment by acting on both the training pairs and the embedding model. Symmetric Nucleus Subsampling (SNS) refines training pairs by trimming raw inputs and annotations to the portions that best support each other. Expert Embedding Engine (EEE) combines complementary embedding experts using a learned projection network, together with a bias-aware objective that reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
