VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search
Shuting He, Hao Luo, Wei Jiang, Xudong Jiang, Henghui Ding

TL;DR
This paper introduces VGSG, a novel network for text-based person search that effectively aligns visual and textual features using semantic grouping and vision-guided knowledge transfer, avoiding external tools and complex interactions.
Contribution
The paper proposes a vision-guided semantic-group network with modules for implicit semantic grouping and knowledge transfer, improving cross-modal alignment in person search tasks.
Findings
VGSG outperforms state-of-the-art methods on benchmark datasets.
The semantic-group textual learning improves local feature extraction.
Vision-guided knowledge transfer enhances feature alignment without external tools.
Abstract
Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsALIGN
