An Empirical Study of CLIP for Text-based Person Search
Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang

TL;DR
This paper conducts a comprehensive empirical study of CLIP for text-based person search, establishing a strong baseline and analyzing design choices, generalization, and compression to guide future research in this cross-modal retrieval task.
Contribution
It is the first to systematically evaluate CLIP for TBPS, providing a straightforward baseline and insights into design, generalization, and model compression.
Findings
CLIP achieves strong performance with simple design choices
Data augmentation and loss functions significantly impact results
TBPS-CLIP generalizes well and can be compressed effectively
Abstract
Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsContrastive Language-Image Pre-training
