Spatiotemporal Transformer for Video-based Person Re-identification
Tianyu Zhang, Longhui Wei, Lingxi Xie, Zijie Zhuang, Yongfei Zhang, Bo, Li, Qi Tian

TL;DR
This paper introduces a perception-constrained Spatiotemporal Transformer for video-based person re-identification, leveraging pre-training on synthesized data to improve cross-domain accuracy and address overfitting issues.
Contribution
It proposes a novel pipeline with pre-training on synthetic data and a specialized Transformer architecture for better feature extraction in video re-identification.
Findings
Achieves significant accuracy improvements on MARS, DukeMTMC-VideoReID, and LS-VID datasets.
Effectively reduces overfitting in Transformer models for structured visual data.
Enhances cross-domain person re-identification performance.
Abstract
Recently, the Transformer module has been transplanted from natural language processing to computer vision. This paper applies the Transformer to video-based person re-identification, where the key issue is to extract the discriminative information from a tracklet. We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting, arguably due to a large number of attention parameters and insufficient training data. To solve this problem, we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gait Recognition and Analysis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Attention Is All You Need · Dropout · Layer Normalization · Residual Connection · Label Smoothing
