Improving Text-based Person Search via Part-level Cross-modal Correspondence
Jicheol Park, Boseung Jeong, Dongwon Kim, and Suha Kwak

TL;DR
This paper presents a novel encoder-decoder model with a ranking loss to improve text-based person search by aligning cross-modal features at multiple levels without explicit supervision, achieving state-of-the-art results.
Contribution
It introduces a coarse-to-fine embedding approach with a new ranking loss that captures part-level correspondence without requiring part annotations.
Findings
Achieves top performance on three public benchmarks.
Effectively aligns text and image features at multiple semantic levels.
Demonstrates the effectiveness of part-level correspondence in person search.
Abstract
Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Speech and dialogue systems
