Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang,, Yifeng Li

TL;DR
This paper introduces LBUL, a novel approach for text-based person retrieval that learns a consistent cross-modal common manifold by considering both visual and textual data distributions, leading to improved retrieval accuracy.
Contribution
The paper proposes LBUL, a new algorithm that addresses the CDCP dilemma by incorporating distribution characteristics of both modalities before embedding, enhancing cross-modal alignment.
Findings
LBUL outperforms previous methods on CUHK-PEDES and RSTPReid datasets.
LBUL achieves state-of-the-art retrieval accuracy.
Considering both modalities' distributions improves cross-modal alignment.
Abstract
The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a \textbf{cross-modal distribution consensus prediction (CDCP)} manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this \textbf{CDCP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
