Text-based Person Search without Parallel Image-Text Data
Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie, and Min Zhang

TL;DR
This paper introduces a novel two-stage framework for text-based person search that does not require parallel image-text data, utilizing generated descriptions and confidence-based training to achieve competitive results.
Contribution
First exploration of TBPS without parallel image-text data, employing a generation-then-retrieval approach with fine-grained captioning and confidence scoring.
Findings
Achieves promising performance on multiple TBPS benchmarks.
Effectively generates detailed image descriptions without parallel data.
Improves training reliability through confidence score-based scheme.
Abstract
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data (-TBPS), in which only non-parallel images and texts, or even image-only data, can be adopted. Towards this end, we propose a two-stage framework, generation-then-retrieval (GTR), to first generate the corresponding pseudo text for each image and then perform the retrieval in a supervised manner. In the generation stage, we propose a fine-grained image captioning strategy to obtain an enriched description of the person image, which firstly utilizes a set of instruction prompts to activate the off-the-shelf pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
