Learning Image-Text Matching with Optimal Partial Transport
Zhengxin Pan, Haishuai Wang, Fangyu Wu, Bailing Zhang, Jiajun Bu, and Hongyang Chen

TL;DR
This paper introduces OMIT, a novel image-text matching network based on Optimal Transport and partial matching, achieving high accuracy and efficiency in cross-modal retrieval tasks.
Contribution
OMIT uniquely integrates Optimal Transport with partial matching to improve semantic alignment and efficiency in image-text matching.
Findings
OMIT outperforms existing methods on Flickr30K and MS-COCO datasets.
OMIT effectively balances performance and computational efficiency.
Visualization confirms OMIT's focus on relevant fragment alignments.
Abstract
Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
