TL;DR
This paper introduces 3SHNet, a novel network that enhances image-sentence retrieval by highlighting salient objects and their spatial relations, leading to superior accuracy, efficiency, and generalization on benchmark datasets.
Contribution
The paper presents 3SHNet, a new model that integrates visual semantics and spatial information for improved image-sentence retrieval, with a focus on highlighting salient regions and maintaining modality independence.
Findings
Achieves significant improvements on MS-COCO and Flickr30K benchmarks.
Demonstrates superior inference efficiency compared to state-of-the-art methods.
Shows enhanced cross-dataset generalization by 18.6%.
Abstract
In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
