Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Shogo Hamano; Shunya Wakasugi; Tatsuhito Sato; Sayaka Nakamura

arXiv:2604.07740·cs.CV·April 10, 2026

Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Shogo Hamano, Shunya Wakasugi, Tatsuhito Sato, Sayaka Nakamura

PDF

TL;DR

This paper introduces CG-CLIP, a caption-guided framework for high-difficulty video person re-identification, leveraging textual descriptions and learnable tokens to improve matching accuracy in challenging scenarios.

Contribution

The paper proposes a novel caption-guided CLIP framework with CMR and TFE components, enhancing feature refinement and spatiotemporal aggregation for difficult ReID tasks.

Findings

01

Outperforms state-of-the-art methods on multiple datasets.

02

Achieves significant accuracy improvements on high-difficulty datasets.

03

Effectively captures fine-grained details using caption-guided refinement.

Abstract

In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.