GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval
Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou

TL;DR
The paper introduces GEA, a novel approach for text-to-image person retrieval that uses generated images to improve cross-modal alignment and address modality gaps, achieving better retrieval accuracy.
Contribution
GEA is the first to incorporate diffusion-generated images and a generative fusion module for enhanced cross-modal alignment in TIPR.
Findings
GEA outperforms existing methods on three public datasets.
Generated images significantly improve semantic representation.
The approach effectively reduces modality gap and overfitting.
Abstract
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Face recognition and analysis
