SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

Yingjia Xu; Jinlin Wu; Daming Gao; Zhen Chen; Yang Yang; Min Cao; Mang Ye; and Zhen Lei

arXiv:2505.24466·cs.CV·November 25, 2025

SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking

Yingjia Xu, Jinlin Wu, Daming Gao, Zhen Chen, Yang Yang, Min Cao, Mang Ye, and Zhen Lei

PDF

TL;DR

SA-Person introduces a scene-aware re-ranking framework for text-based person retrieval, leveraging global scene context and a new large-scale dataset to improve accuracy in complex real-world scenarios.

Contribution

It proposes a novel scene-aware retrieval paradigm, introduces the ScenePerson-13W dataset, and develops a two-stage framework with a scene-aware re-ranking module.

Findings

01

Significant improvement over existing methods on ScenePerson-13W

02

Effective integration of scene context enhances retrieval accuracy

03

Public release of dataset and code to support future research

Abstract

Text-based person retrieval aims to identify a target individual from an image gallery using a natural language description. Existing methods primarily focus on appearance-driven cross-modal retrieval, yet face significant challenges due to the visual complexity of scenes and the inherent ambiguity of textual descriptions. The contextual information, such as landmarks and relational cues, provides complementary cues that can offer valuable complementary insights for retrieval, but remains underexploited in current approaches. Motivated by this limitation, we propose a novel paradigm: scene-aware text-based person retrieval, which explicitly integrates both individual appearance and global scene context to improve retrieval accuracy. To support this, we first introduce ScenePerson-13W, a large-scale benchmark dataset comprising over 100,000 real-world scenes with rich annotations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.