GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

Hao Zou; Runqing Zhang; Xue Zhou; Jianxiao Zou

arXiv:2511.10154·cs.CV·November 14, 2025

GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou

PDF

Open Access

TL;DR

The paper introduces GEA, a novel approach for text-to-image person retrieval that uses generated images to improve cross-modal alignment and address modality gaps, achieving better retrieval accuracy.

Contribution

GEA is the first to incorporate diffusion-generated images and a generative fusion module for enhanced cross-modal alignment in TIPR.

Findings

01

GEA outperforms existing methods on three public datasets.

02

Generated images significantly improve semantic representation.

03

The approach effectively reduces modality gap and overfitting.

Abstract

Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Face recognition and analysis