CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification

Xiaomei Yang; Xizhan Gao; Sijie Niu; Fa Zhu; Guang Feng; Xiaofeng Qu; and David Camacho

arXiv:2511.10309·cs.CV·November 14, 2025

CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification

Xiaomei Yang, Xizhan Gao, Sijie Niu, Fa Zhu, Guang Feng, Xiaofeng Qu, and David Camacho

PDF

Open Access

TL;DR

This paper introduces CLIP4VI-ReID, a novel network leveraging CLIP to learn shared representations for visible-infrared person re-identification, using text semantics to bridge modality gaps and improve alignment.

Contribution

It proposes a CLIP-driven framework with Text Semantic Generation, Infrared Feature Embedding, and High-level Semantic Alignment for effective cross-modal person re-identification.

Findings

01

Achieves superior performance on VI-ReID datasets.

02

Effectively aligns visible and infrared modalities using text semantics.

03

Enhances discriminability of shared representations.

Abstract

This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Face recognition and analysis