Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu

TL;DR
This paper introduces a novel structured multi-modal embedding and alignment model that explicitly models semantic and structural relationships within and across image and text modalities, significantly improving image-sentence retrieval performance.
Contribution
The paper proposes a new SMFEA model with shared structured tree encoders that explicitly align visual and textual structures, enhancing cross-modal semantic understanding.
Findings
Outperforms state-of-the-art on Microsoft COCO and Flickr30K datasets
Effectively models intra- and inter-modal structural relationships
Improves retrieval accuracy through explicit structural alignment
Abstract
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog play ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities. In this paper, we propose a novel Structured Multi-modal Feature Embedding and Alignment (SMFEA) model for image-sentence retrieval. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
