Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing, Huang, Jianqing Fan

TL;DR
This paper introduces phrase-level semantic supervision for image-text retrieval, enhancing the identification of mismatched units by constructing multi-grained labels and employing a specialized transformer model.
Contribution
It proposes a novel multi-grained supervision framework and a Semantic Structure Aware Multimodal Transformer to improve image-text matching accuracy.
Findings
Improved retrieval performance on MS-COCO and Flickr30K datasets.
Effective phrase-level mismatch penalization enhances model precision.
Multi-scale matching losses benefit multi-grain semantic alignment.
Abstract
Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Label Smoothing · Residual Connection
