Enhancing Visual Representation for Text-based Person Searching

Wei Shen; Ming Fang; Yuxia Wang; Jiafeng Xiao; Diping Li; and Huangqun Chen; Ling Xu; Weifeng Zhang

arXiv:2412.20646·cs.CV·December 31, 2024

Enhancing Visual Representation for Text-based Person Searching

Wei Shen, Ming Fang, Yuxia Wang, Jiafeng Xiao, Diping Li, and Huangqun Chen, Ling Xu, Weifeng Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VFE-TPS, a novel model that enhances visual feature understanding in text-based person search by leveraging pre-trained multimodal models and auxiliary tasks, leading to improved accuracy.

Contribution

It proposes a new approach combining CLIP with auxiliary tasks to better learn local and global visual features for person search.

Findings

01

Significant improvement in Rank-1 accuracy on three benchmarks.

02

Effective adaptation of pre-trained CLIP for detailed visual understanding.

03

Auxiliary tasks enhance the model's ability to learn local and global visual features.

Abstract

Text-based person search aims to retrieve the matched pedestrians from a large-scale image database according to the text description. The core difficulty of this task is how to extract effective details from pedestrian images and texts, and achieve cross-modal alignment in a common latent space. Prior works adopt image and text encoders pre-trained on unimodal data to extract global and local features from image and text respectively, and then global-local alignment is achieved explicitly. However, these approaches still lack the ability of understanding visual details, and the retrieval accuracy is still limited by identity confusion. In order to alleviate the above problems, we rethink the importance of visual features for text-based person search, and propose VFE-TPS, a Visual Feature Enhanced Text-based Person Search model. It introduces a pre-trained multimodal backbone CLIP to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhangweifeng1218/vfe_tps
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Video Analysis and Summarization

MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Language-Image Pre-training