Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Shuang Li; Jiaxu Leng; Changjiang Kuang; Mingpi Tan; Xinbo Gao

arXiv:2506.02439·cs.CV·June 4, 2025

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao

PDF

Open Access

TL;DR

This paper introduces a novel framework that leverages language prompts and spatiotemporal modeling to improve cross-modality person re-identification in videos, achieving state-of-the-art results.

Contribution

It proposes a new language-driven approach with invariant-modality prompting and spatiotemporal modules to bridge modality gaps in video-based person re-identification.

Findings

01

Achieves state-of-the-art performance on VVI-ReID benchmarks

02

Effectively mitigates modality differences using language prompts

03

Enhances spatiotemporal feature modeling with dedicated modules

Abstract

Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Face recognition and analysis