Weakly supervised cross-domain alignment with optimal transport
Siyang Yuan, Ke Bai, Liqun Chen, Yizhe Zhang, Chenyang Tao, Chunyuan, Li, Guoyin Wang, Ricardo Henao, Lawrence Carin

TL;DR
This paper introduces a weakly-supervised method leveraging optimal transport to improve fine-grained cross-domain alignment between images and text, enhancing performance with simpler models.
Contribution
It proposes a novel optimal transport-based regularizer for cross-domain alignment that is efficient and compatible with existing models, advancing weakly-supervised vision-language tasks.
Findings
Outperforms state-of-the-art methods on vision-language benchmarks
Enables simpler models to achieve competitive results
Demonstrates efficiency and effectiveness of OT regularization
Abstract
Cross-domain alignment between image objects and text sequences is key to many visual-language tasks, and it poses a fundamental challenge to both computer vision and natural language processing. This paper investigates a novel approach for the identification and optimization of fine-grained semantic similarities between image and text entities, under a weakly-supervised setup, improving performance over state-of-the-art solutions. Our method builds upon recent advances in optimal transport (OT) to resolve the cross-domain matching problem in a principled manner. Formulated as a drop-in regularizer, the proposed OT solution can be efficiently computed and used in combination with other existing approaches. We present empirical evidence to demonstrate the effectiveness of our approach, showing how it enables simpler model architectures to outperform or be comparable with more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
