PersonViT: Large-scale Self-supervised Vision Transformer for Person   Re-Identification

Bin Hu; Xinggang Wang; Wenyu Liu

arXiv:2408.05398·cs.CV·August 21, 2024

PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification

Bin Hu, Xinggang Wang, Wenyu Liu

PDF

Open Access 1 Repo 1 Models

TL;DR

PersonViT introduces a large-scale self-supervised Vision Transformer approach for person re-identification, effectively combining masked image modeling and contrastive learning to extract detailed local and global features, achieving state-of-the-art results.

Contribution

The paper presents a novel self-supervised ViT-based method, PersonViT, that effectively captures local and global features for person ReID without requiring extensive annotations.

Findings

01

Achieves state-of-the-art performance on multiple benchmark datasets.

02

Effectively combines masked image modeling with contrastive learning.

03

Demonstrates strong generalization and scalability in person ReID tasks.

Abstract

Person Re-Identification (ReID) aims to retrieve relevant individuals in non-overlapping camera images and has a wide range of applications in the field of public safety. In recent years, with the development of Vision Transformer (ViT) and self-supervised learning techniques, the performance of person ReID based on self-supervised pre-training has been greatly improved. Person ReID requires extracting highly discriminative local fine-grained features of the human body, while traditional ViT is good at extracting context-related global features, making it difficult to focus on local human body features. To this end, this article introduces the recently emerged Masked Image Modeling (MIM) self-supervised learning method into person ReID, and effectively extracts high-quality global and local features through large-scale unsupervised pre-training by combining masked image modeling and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/personvit
pytorchOfficial

Models

🤗
simoswish/Person_Search_PRW
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Vision Transformer