Learning Robust Visual-Semantic Embedding for Generalizable Person   Re-identification

Suncheng Xiang; Jingsheng Gao; Mengyuan Guan; Jiacheng Ruan; Chengfeng; Zhou; Ting Liu; Dahong Qian; Yuzhuo Fu

arXiv:2304.09498·cs.CV·April 20, 2023·5 cites

Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification

Suncheng Xiang, Jingsheng Gao, Mengyuan Guan, Jiacheng Ruan, Chengfeng, Zhou, Ting Liu, Dahong Qian, Yuzhuo Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces MMET, a multi-modal transformer with dynamic masking, to improve the robustness and generalization of visual-semantic embeddings for person re-identification across different domains.

Contribution

The paper proposes a novel multi-modal transformer architecture with a dynamic masking strategy to enhance generalizable visual-semantic embedding learning for person Re-ID.

Findings

01

Outperforms previous methods on benchmark datasets

02

Demonstrates robustness in cross-domain scenarios

03

Achieves state-of-the-art results in generalizable person Re-ID

Abstract

Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jeremyxsc/mmet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Gait Recognition and Analysis · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Absolute Position Encodings · Residual Connection · Softmax