TL;DR
UniMoCo introduces a modality-completion approach for multi-modal embeddings, generating visual features from text to improve robustness across diverse modality combinations in vision-language tasks.
Contribution
It proposes a novel architecture with modality completion and alignment strategies, enabling unified and robust multi-modal embeddings during training and inference.
Findings
Outperforms previous methods on multi-modal retrieval tasks.
Demonstrates robustness across diverse modality combinations.
Mitigates bias caused by imbalanced training data.
Abstract
Current vision-language models have been explored for multi-modal embedding tasks like information retrieval. However, they face significant challenges in real-world queries and targets involving diverse modality combinations, as existing approaches often fail to align all modality combinations within a unified embedding space during training, leading to degraded performance on rare modality patterns during inference. To address this fundamental limitation, we propose UniMoCo, a novel architecture featuring a modality-completion module that generates visual features from text, thereby ensuring modality completeness for both queries and targets. Additionally, UniMoCo incorporates a specialized training strategy that aligns embeddings from both original and modality-completed inputs, thus ensuring consistent and robust embeddings for diverse modality combinations. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
