Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin; Chao Chen; Zhihang Fu; Dezhong Peng; Xi Peng; Peng Hu

arXiv:2506.11036·cs.LG·June 16, 2025

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin, Chao Chen, Zhihang Fu, Dezhong Peng, Xi Peng, Peng Hu

PDF

Open Access

TL;DR

This paper introduces an interactive learning framework using multimodal large language models to improve text-to-image person re-identification by refining queries and augmenting data, leading to significant performance gains.

Contribution

The paper proposes a novel human-centered interactive learning approach with a plug-and-play test-time module and a data augmentation strategy for enhanced TIReID performance.

Findings

01

Achieves state-of-the-art results on four TIReID benchmarks.

02

Effectively refines user queries through multimodal interactions.

03

Enhances training data quality with a new augmentation method.

Abstract

Despite remarkable advancements in text-to-image person re-identification (TIReID) facilitated by the breakthrough of cross-modal embedding models, existing methods often struggle to distinguish challenging candidate images due to intrinsic limitations, such as network architecture and data quality. To address these issues, we propose an Interactive Cross-modal Learning framework (ICL), which leverages human-centered interaction to enhance the discriminability of text queries through external multimodal knowledge. To achieve this, we propose a plug-and-play Test-time Humane-centered Interaction (THI) module, which performs visual question answering focused on human characteristics, facilitating multi-round interactions with a multimodal large language model (MLLM) to align query intent with latent target images. Specifically, THI refines user queries based on the MLLM responses to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis

MethodsALIGN