SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

Yunfei Xie; Yuxuan Cheng; Juncheng Wu; Haoyu Zhang; Yuyin Zhou; Shoudong Han

arXiv:2507.00506·cs.CV·July 2, 2025

SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

Yunfei Xie, Yuxuan Cheng, Juncheng Wu, Haoyu Zhang, Yuyin Zhou, Shoudong Han

PDF

Open Access

TL;DR

SCING introduces a lightweight, cross-modal prompt tuning framework that improves person re-identification by enhancing alignment and robustness while reducing computational costs, outperforming existing methods on multiple benchmarks.

Contribution

The paper proposes SCING, a novel framework with SVIP and PDCA modules, for efficient and robust person ReID using cross-modal prompt tuning without heavy adapters.

Findings

01

Achieves state-of-the-art performance on multiple ReID benchmarks.

02

Reduces computational overhead compared to existing adapter-based methods.

03

Enhances robustness against real-world image perturbations.

Abstract

Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Multimodal Machine Learning Applications