Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching

Junyu Chen; Yihua Gao; Mingyuan Ge; Mingyong Li

arXiv:2507.09256·cs.CV·July 15, 2025

Ambiguity-Aware and High-Order Relation Learning for Multi-Grained Image-Text Matching

Junyu Chen, Yihua Gao, Mingyuan Ge, Mingyong Li

PDF

TL;DR

This paper introduces AAHR, a novel framework that enhances multi-grained image-text matching by addressing semantic ambiguities and leveraging high-order relations through dynamic clustering, GNNs, and contrastive learning.

Contribution

It proposes a unified representation space and relation learning strategies that improve semantic understanding and discrimination in image-text matching tasks.

Findings

01

Outperforms state-of-the-art on Flickr30K, MSCOCO, ECCV Caption datasets.

02

Significantly improves matching accuracy and efficiency.

03

Effectively mitigates semantic ambiguities and leverages high-order relations.

Abstract

Image-text matching is crucial for bridging the semantic gap between computer vision and natural language processing. However, existing methods still face challenges in handling high-order associations and semantic ambiguities among similar instances. These ambiguities arise from subtle differences between soft positive samples (semantically similar but incorrectly labeled) and soft negative samples (locally matched but globally inconsistent), creating matching uncertainties. Furthermore, current methods fail to fully utilize the neighborhood relationships among semantically similar instances within training batches, limiting the model's ability to learn high-order shared knowledge. This paper proposes the Ambiguity-Aware and High-order Relation learning framework (AAHR) to address these issues. AAHR constructs a unified representation space through dynamic clustering prototype…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.