Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification
Suncheng Xiang, Jingsheng Gao, Mengyuan Guan, Jiacheng Ruan, Chengfeng, Zhou, Ting Liu, Dahong Qian, Yuzhuo Fu

TL;DR
This paper introduces MMET, a multi-modal transformer with dynamic masking, to improve the robustness and generalization of visual-semantic embeddings for person re-identification across different domains.
Contribution
The paper proposes a novel multi-modal transformer architecture with a dynamic masking strategy to enhance generalizable visual-semantic embedding learning for person Re-ID.
Findings
Outperforms previous methods on benchmark datasets
Demonstrates robustness in cross-domain scenarios
Achieves state-of-the-art results in generalizable person Re-ID
Abstract
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Absolute Position Encodings · Residual Connection · Softmax
