Unified Multi-Dataset Training for TBPS

Nilanjana Chatterjee; Sidharatha Garg; A V Subramanyam; Brejesh Lall

arXiv:2601.14978·cs.CV·January 22, 2026

Unified Multi-Dataset Training for TBPS

Nilanjana Chatterjee, Sidharatha Garg, A V Subramanyam, Brejesh Lall

PDF

Open Access

TL;DR

This paper introduces Scale-TBPS, a unified training approach for Text-Based Person Search that effectively combines multiple datasets, overcoming previous limitations of dataset-specific fine-tuning and noisy data, leading to superior performance.

Contribution

It proposes a noise-aware dataset curation and a scalable discriminative identity learning framework for unified TBPS training across multiple datasets.

Findings

01

Unified model outperforms dataset-specific models

02

Effective handling of noisy image-text pairs

03

Scales well with many person identities

Abstract

Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Face recognition and analysis