Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

Xiaomei Yang; Xizhan Gao; Antai Liu; Kang Wei; Fa Zhu; Guang Feng; Xiaofeng Qu; and Sijie Niu

arXiv:2601.12062·cs.CV·January 21, 2026

Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

Xiaomei Yang, Xizhan Gao, Antai Liu, Kang Wei, Fa Zhu, Guang Feng, Xiaofeng Qu, and Sijie Niu

PDF

Open Access

TL;DR

This paper introduces LSMRL, a novel method for video-based visible-infrared person re-identification that enhances modal-invariant feature learning through language-driven modules and cross-modal interactions, achieving superior performance.

Contribution

The paper proposes a comprehensive LSMRL framework with novel modules for efficient spatial-temporal modeling and explicit modality-level loss guidance, advancing cross-modal person re-identification.

Findings

01

LSMRL outperforms existing methods on large-scale VVI-ReID datasets.

02

The SD module effectively diffuses language prompts into visual features.

03

The CMI module refines modal-invariant representations through bidirectional self-attention.

Abstract

The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Advanced Neural Network Applications