Enhancing Visual Representation for Text-based Person Searching
Wei Shen, Ming Fang, Yuxia Wang, Jiafeng Xiao, Diping Li, and Huangqun Chen, Ling Xu, Weifeng Zhang

TL;DR
This paper introduces VFE-TPS, a novel model that enhances visual feature understanding in text-based person search by leveraging pre-trained multimodal models and auxiliary tasks, leading to improved accuracy.
Contribution
It proposes a new approach combining CLIP with auxiliary tasks to better learn local and global visual features for person search.
Findings
Significant improvement in Rank-1 accuracy on three benchmarks.
Effective adaptation of pre-trained CLIP for detailed visual understanding.
Auxiliary tasks enhance the model's ability to learn local and global visual features.
Abstract
Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Video Analysis and Summarization
MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training
