Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Min Cao, Xinyu Zhou, Ding Jiang, Bo Du, Mang Ye, Min Zhang

TL;DR
This paper introduces a multilingual text-to-image person retrieval framework that leverages bidirectional relation reasoning and alignment to improve cross-modal and cross-language matching, achieving state-of-the-art results.
Contribution
It pioneers a multilingual TIPR benchmark and proposes Bi-IRRA, a novel framework that models local relations and global alignment across languages and modalities.
Findings
Achieves new state-of-the-art results on multilingual TIPR datasets.
Effectively models local relations across languages and modalities.
Enhances cross-modal and cross-language alignment with bidirectional reasoning.
Abstract
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
