Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
Shogo Hamano, Shunya Wakasugi, Tatsuhito Sato, Sayaka Nakamura

TL;DR
This paper introduces CG-CLIP, a caption-guided framework for high-difficulty video person re-identification, leveraging textual descriptions and learnable tokens to improve matching accuracy in challenging scenarios.
Contribution
The paper proposes a novel caption-guided CLIP framework with CMR and TFE components, enhancing feature refinement and spatiotemporal aggregation for difficult ReID tasks.
Findings
Outperforms state-of-the-art methods on multiple datasets.
Achieves significant accuracy improvements on high-difficulty datasets.
Effectively captures fine-grained details using caption-guided refinement.
Abstract
In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
