CLIP-Driven Fine-grained Text-Image Person Re-identification
Shuanglin Yan, Neng Dong, Liyan Zhang, Jinhui Tang

TL;DR
This paper introduces CFine, a CLIP-based framework for fine-grained text-image person re-identification, which effectively mines intra- and inter-modal discriminative clues without additional feature embedding, leading to superior performance.
Contribution
The paper proposes a novel CLIP-driven framework with multi-grained feature learning, cross-grained refinement, and fine-grained correspondence discovery for improved TIReID.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively mines intra-modal discriminative clues.
Establishes precise cross-modal correspondences.
Abstract
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondences. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for cross-modal alignment. However, feature embedding may lead to intra-modal information distortion. Recently, CLIP has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in the paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsContrastive Language-Image Pre-training
