Constructing Phrase-level Semantic Labels to Form Multi-Grained   Supervision for Image-Text Retrieval

Zhihao Fan; Zhongyu Wei; Zejun Li; Siyuan Wang; Haijun Shan; Xuanjing; Huang; Jianqing Fan

arXiv:2109.05523·cs.CV·September 14, 2021·1 cites

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing, Huang, Jianqing Fan

PDF

Open Access

TL;DR

This paper introduces phrase-level semantic supervision for image-text retrieval, enhancing the identification of mismatched units by constructing multi-grained labels and employing a specialized transformer model.

Contribution

It proposes a novel multi-grained supervision framework and a Semantic Structure Aware Multimodal Transformer to improve image-text matching accuracy.

Findings

01

Improved retrieval performance on MS-COCO and Flickr30K datasets.

02

Effective phrase-level mismatch penalization enhances model precision.

03

Multi-scale matching losses benefit multi-grain semantic alignment.

Abstract

Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Label Smoothing · Residual Connection