SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Jing Liu; Donglai Wei; Yang Liu; Sipeng Zhang; Tong Yang; Wei Zhou; Weiping Ding; Victor C. M. Leung

arXiv:2304.02278·cs.CV·December 9, 2025·6 cites

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Wei Zhou, Weiping Ding, Victor C. M. Leung

PDF

Open Access

TL;DR

SCMM introduces a unified framework with novel calibration and masked modeling techniques to improve cross-modal representations for text-based person search, achieving state-of-the-art results.

Contribution

The paper proposes SCMM, a new method combining sew calibration and masked caption modeling to enhance fine-grained cross-modal alignment in person search.

Findings

01

Achieves state-of-the-art Rank1 accuracy on CUHK-PEDES (73.81%)

02

Demonstrates effectiveness of each component through ablation studies

03

Balances representation quality and computational efficiency

Abstract

Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications