ALADIN:Attribute-Language Distillation Network for Person Re-Identification
Wang Zhou, Boran Duan, Haojun Ai, Ruiqi Lan, Ziyue Zhou

TL;DR
ALADIN introduces a novel attribute-language distillation approach for person re-identification, enhancing fine-grained attribute understanding and robustness by leveraging CLIP and multimodal LLMs.
Contribution
It proposes a new attribute-local alignment and distillation framework that improves ReID performance and interpretability over existing global feature-based methods.
Findings
Significant performance gains on Market-1501, DukeMTMC-reID, and MSMT17 datasets.
Enhanced robustness under occlusions through attribute-local distillation.
Better generalization and interpretability compared to CNN, Transformer, and CLIP-based methods.
Abstract
Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
