CLIP-SCGI: Synthesized Caption-Guided Inversion for Person   Re-Identification

Qianru Han; Xinwei He; Zhi Liu; Sannyuya Liu; Ying Zhang; and Jinhai; Xiang

arXiv:2410.09382·cs.CV·October 15, 2024

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

Qianru Han, Xinwei He, Zhi Liu, Sannyuya Liu, Ying Zhang, and Jinhai, Xiang

PDF

Open Access

TL;DR

This paper introduces CLIP-SCGI, a framework that uses synthesized captions generated by image captioning models to improve person re-identification by guiding the learning of more discriminative visual features.

Contribution

It proposes a simple, effective method leveraging caption synthesis and caption-guided inversion to enhance vision-language models for person ReID, addressing caption quality issues.

Findings

01

Outperforms state-of-the-art on four ReID benchmarks.

02

Effectively captures semantic attributes for robust feature learning.

03

Enhances discriminative power of visual representations.

Abstract

Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP). However, the absence of concrete descriptions necessitates the use of implicit text embeddings, which demand complicated and inefficient training strategies. To address this issue, we first propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images, and thereby boost person re-identification with large vision language models. Using models like the Large Language and Vision Assistant (LLAVA), we generate high-quality captions based on fixed templates that capture key semantic attributes such as gender, clothing, and age. By augmenting ReID training sets from uni-modality (image) to bi-modality (image and text), we introduce CLIP-SCGI, a simple yet effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition

MethodsFocus · Contrastive Language-Image Pre-training