Structured Multi-modal Feature Embedding and Alignment for   Image-Sentence Retrieval

Xuri Ge; Fuhai Chen; Joemon M. Jose; Zhilong Ji; Zhongqin Wu; Xiao Liu

arXiv:2108.02417·cs.CV·August 6, 2021

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu

PDF

TL;DR

This paper introduces a novel structured multi-modal embedding and alignment model that explicitly models semantic and structural relationships within and across image and text modalities, significantly improving image-sentence retrieval performance.

Contribution

The paper proposes a new SMFEA model with shared structured tree encoders that explicitly align visual and textual structures, enhancing cross-modal semantic understanding.

Findings

01

Outperforms state-of-the-art on Microsoft COCO and Flickr30K datasets

02

Effectively models intra- and inter-modal structural relationships

03

Improves retrieval accuracy through explicit structural alignment

Abstract

The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.