Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval
Delong Liu, Haiwen Li, Zhicheng Zhao, Yuan Dong

TL;DR
This paper introduces a novel framework for Text-to-Image Person Retrieval that enhances cross-modal alignment through fine-tuning CLIP, a text-guided image restoration task, and data augmentation, leading to improved retrieval accuracy.
Contribution
The paper proposes a new TIPR framework with a text-guided image restoration task and pruning-based data augmentation, advancing fine-grained cross-modal alignment and discriminability.
Findings
Outperforms state-of-the-art on three benchmark datasets.
Effective alignment of local textual and visual features.
Improved discriminability for minor differences in person images.
Abstract
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsFocus · Triplet Loss · Contrastive Language-Image Pre-training
