CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search
Zequn Xie

TL;DR
CONQUER is a novel two-stage framework that enhances text-based person search by improving cross-modal alignment during training and refining user queries at inference, leading to better retrieval accuracy especially with ambiguous or incomplete queries.
Contribution
The paper introduces CONQUER, a new framework that combines multi-granularity encoding, optimal transport-based matching, and query refinement, advancing the state-of-the-art in text-based person search.
Findings
Outperforms existing methods on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets.
Improves Rank-1 accuracy and mAP in cross-domain and incomplete-query scenarios.
Provides a practical, plug-and-play query enhancement module without retraining the backbone.
Abstract
Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Multimodal Machine Learning Applications · Advanced Neural Network Applications
