3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial   Self-Highlighting

Xuri Ge; Songpei Xu; Fuhai Chen; Jie Wang; Guoxin Wang; Shan An,; Joemon M. Jose

arXiv:2404.17273·cs.CV·April 29, 2024

3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting

Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An,, Joemon M. Jose

PDF

1 Repo

TL;DR

This paper introduces 3SHNet, a novel network that enhances image-sentence retrieval by highlighting salient objects and their spatial relations, leading to superior accuracy, efficiency, and generalization on benchmark datasets.

Contribution

The paper presents 3SHNet, a new model that integrates visual semantics and spatial information for improved image-sentence retrieval, with a focus on highlighting salient regions and maintaining modality independence.

Findings

01

Achieves significant improvements on MS-COCO and Flickr30K benchmarks.

02

Demonstrates superior inference efficiency compared to state-of-the-art methods.

03

Shows enhanced cross-dataset generalization by 18.6%.

Abstract

In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xurige1995/3shnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.