Learning Image-Text Matching with Optimal Partial Transport

Zhengxin Pan; Haishuai Wang; Fangyu Wu; Bailing Zhang; Jiajun Bu; and Hongyang Chen

arXiv:2603.14349·cs.IR·March 17, 2026

Learning Image-Text Matching with Optimal Partial Transport

Zhengxin Pan, Haishuai Wang, Fangyu Wu, Bailing Zhang, Jiajun Bu, and Hongyang Chen

PDF

Open Access

TL;DR

This paper introduces OMIT, a novel image-text matching network based on Optimal Transport and partial matching, achieving high accuracy and efficiency in cross-modal retrieval tasks.

Contribution

OMIT uniquely integrates Optimal Transport with partial matching to improve semantic alignment and efficiency in image-text matching.

Findings

01

OMIT outperforms existing methods on Flickr30K and MS-COCO datasets.

02

OMIT effectively balances performance and computational efficiency.

03

Visualization confirms OMIT's focus on relevant fragment alignments.

Abstract

Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques