Text-guided Image Restoration and Semantic Enhancement for Text-to-Image   Person Retrieval

Delong Liu; Haiwen Li; Zhicheng Zhao; Yuan Dong

arXiv:2307.09059·cs.CL·January 20, 2025·1 cites

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

Delong Liu, Haiwen Li, Zhicheng Zhao, Yuan Dong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel framework for Text-to-Image Person Retrieval that enhances cross-modal alignment through fine-tuning CLIP, a text-guided image restoration task, and data augmentation, leading to improved retrieval accuracy.

Contribution

The paper proposes a new TIPR framework with a text-guided image restoration task and pruning-based data augmentation, advancing fine-grained cross-modal alignment and discriminability.

Findings

01

Outperforms state-of-the-art on three benchmark datasets.

02

Effective alignment of local textual and visual features.

03

Improved discriminability for minor differences in person images.

Abstract

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

delong-liu-bupt/sen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsFocus · Triplet Loss · Contrastive Language-Image Pre-training