SCMM: Calibrating Cross-modal Representations for Text-Based Person Search
Jing Liu, Donglai Wei, Yang Liu, Sipeng Zhang, Tong Yang, Wei Zhou, Weiping Ding, Victor C. M. Leung

TL;DR
SCMM introduces a unified framework with novel calibration and masked modeling techniques to improve cross-modal representations for text-based person search, achieving state-of-the-art results.
Contribution
The paper proposes SCMM, a new method combining sew calibration and masked caption modeling to enhance fine-grained cross-modal alignment in person search.
Findings
Achieves state-of-the-art Rank1 accuracy on CUHK-PEDES (73.81%)
Demonstrates effectiveness of each component through ablation studies
Balances representation quality and computational efficiency
Abstract
Text-Based Person Search (TBPS) aims to retrieve target person images from a large-scale gallery using natural language descriptions, posing fundamental challenges in cross-modal representation learning. Existing methods often struggle to bridge the semantic gap between heterogeneous modalities while capturing fine-grained correspondences essential for discriminating visually similar individuals. To address these challenges, we propose Sew Calibration and Masked Modeling (SCMM), a unified framework that calibrates cross-modal representations through complementary learning mechanisms. Notably, SCMM introduces two novel components: a sew calibration loss that dynamically aligns image-text features using quality-guided adaptive margins based on textual information density, and a masked caption modeling loss that establishes fine-grained cross-modal correspondences through transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
