CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification
Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai, Xiang

TL;DR
This paper introduces CLIP-SCGI, a framework that uses synthesized captions generated by image captioning models to improve person re-identification by guiding the learning of more discriminative visual features.
Contribution
It proposes a simple, effective method leveraging caption synthesis and caption-guided inversion to enhance vision-language models for person ReID, addressing caption quality issues.
Findings
Outperforms state-of-the-art on four ReID benchmarks.
Effectively captures semantic attributes for robust feature learning.
Enhances discriminative power of visual representations.
Abstract
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP). However, the absence of concrete descriptions necessitates the use of implicit text embeddings, which demand complicated and inefficient training strategies. To address this issue, we first propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images, and thereby boost person re-identification with large vision language models. Using models like the Large Language and Vision Assistant (LLAVA), we generate high-quality captions based on fixed templates that capture key semantic attributes such as gender, clothing, and age. By augmenting ReID training sets from uni-modality (image) to bi-modality (image and text), we introduce CLIP-SCGI, a simple yet effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition
MethodsFocus · Contrastive Language-Image Pre-training
