A New Fine-grained Alignment Method for Image-text Matching
Yang Zhang

TL;DR
This paper introduces CPFEAN, a novel alignment method for image-text retrieval that emphasizes prominent segments and reduces irrelevant regions, leading to significant accuracy improvements.
Contribution
The paper proposes a new alignment approach that enhances prominent fragment matching and incorporates prior textual info, outperforming existing methods.
Findings
Outperforms state-of-the-art methods by 5-10% in rSum metric.
Effectively reduces irrelevant region influence during alignment.
Improves retrieval accuracy on MS-COCO and Flickr30K datasets.
Abstract
Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
