Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification
Xiaomei Yang, Xizhan Gao, Antai Liu, Kang Wei, Fa Zhu, Guang Feng, Xiaofeng Qu, and Sijie Niu

TL;DR
This paper introduces LSMRL, a novel method for video-based visible-infrared person re-identification that enhances modal-invariant feature learning through language-driven modules and cross-modal interactions, achieving superior performance.
Contribution
The paper proposes a comprehensive LSMRL framework with novel modules for efficient spatial-temporal modeling and explicit modality-level loss guidance, advancing cross-modal person re-identification.
Findings
LSMRL outperforms existing methods on large-scale VVI-ReID datasets.
The SD module effectively diffuses language prompts into visual features.
The CMI module refines modal-invariant representations through bidirectional self-attention.
Abstract
The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Advanced Neural Network Applications
