SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
Yingjia Xu, Jinlin Wu, Daming Gao, Zhen Chen, Yang Yang, Min Cao, Mang Ye, and Zhen Lei

TL;DR
SA-Person introduces a scene-aware re-ranking framework for text-based person retrieval, leveraging global scene context and a new large-scale dataset to improve accuracy in complex real-world scenarios.
Contribution
It proposes a novel scene-aware retrieval paradigm, introduces the ScenePerson-13W dataset, and develops a two-stage framework with a scene-aware re-ranking module.
Findings
Significant improvement over existing methods on ScenePerson-13W
Effective integration of scene context enhances retrieval accuracy
Public release of dataset and code to support future research
Abstract
Text-based person retrieval aims to identify a target individual from an image gallery using a natural language description. Existing methods primarily focus on appearance-driven cross-modal retrieval, yet face significant challenges due to the visual complexity of scenes and the inherent ambiguity of textual descriptions. The contextual information, such as landmarks and relational cues, provides complementary cues that can offer valuable complementary insights for retrieval, but remains underexploited in current approaches. Motivated by this limitation, we propose a novel paradigm: scene-aware text-based person retrieval, which explicitly integrates both individual appearance and global scene context to improve retrieval accuracy. To support this, we first introduce ScenePerson-13W, a large-scale benchmark dataset comprising over 100,000 real-world scenes with rich annotations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
