PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification
Bin Hu, Xinggang Wang, Wenyu Liu

TL;DR
PersonViT introduces a large-scale self-supervised Vision Transformer approach for person re-identification, effectively combining masked image modeling and contrastive learning to extract detailed local and global features, achieving state-of-the-art results.
Contribution
The paper presents a novel self-supervised ViT-based method, PersonViT, that effectively captures local and global features for person ReID without requiring extensive annotations.
Findings
Achieves state-of-the-art performance on multiple benchmark datasets.
Effectively combines masked image modeling with contrastive learning.
Demonstrates strong generalization and scalability in person ReID tasks.
Abstract
Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Vision Transformer
