Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation
Chao Yuan, Yujian Zhao, Haoxuan Xu, Guanglin Niu

TL;DR
This paper introduces a multi-view semantic reformulation framework driven by LLMs to improve robustness in text-to-image person retrieval by addressing expression drift and enhancing cross-modal consistency.
Contribution
It proposes a novel multi-view reformulation and feature compensation approach that boosts retrieval accuracy without additional training.
Findings
Achieves state-of-the-art results on three datasets.
Enhances cross-modal consistency through multi-view semantic reformulation.
Improves robustness without additional training.
Abstract
In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
