DiCo: Disentangled Concept Representation for Text-to-image Person Re-identification

Giyeol Kim; Chanho Eom

arXiv:2601.10053·cs.CV·February 12, 2026

DiCo: Disentangled Concept Representation for Text-to-image Person Re-identification

Giyeol Kim, Chanho Eom

PDF

Open Access

TL;DR

This paper introduces DiCo, a novel framework for text-to-image person re-identification that uses hierarchical, disentangled cross-modal representations to improve fine-grained matching and interpretability.

Contribution

DiCo is the first to employ shared slot-based, disentangled concept representations for hierarchical cross-modal alignment in TIReID.

Findings

01

Achieves competitive performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets.

02

Enhances interpretability through explicit slot- and block-level representations.

03

Demonstrates effective disentanglement of attributes like color, texture, and shape.

Abstract

Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Advanced Neural Network Applications