Spatiotemporal Transformer for Video-based Person Re-identification

Tianyu Zhang; Longhui Wei; Lingxi Xie; Zijie Zhuang; Yongfei Zhang; Bo; Li; Qi Tian

arXiv:2103.16469·cs.CV·March 31, 2021·30 cites

Spatiotemporal Transformer for Video-based Person Re-identification

Tianyu Zhang, Longhui Wei, Lingxi Xie, Zijie Zhuang, Yongfei Zhang, Bo, Li, Qi Tian

PDF

Open Access

TL;DR

This paper introduces a perception-constrained Spatiotemporal Transformer for video-based person re-identification, leveraging pre-training on synthesized data to improve cross-domain accuracy and address overfitting issues.

Contribution

It proposes a novel pipeline with pre-training on synthetic data and a specialized Transformer architecture for better feature extraction in video re-identification.

Findings

01

Achieves significant accuracy improvements on MARS, DukeMTMC-VideoReID, and LS-VID datasets.

02

Effectively reduces overfitting in Transformer models for structured visual data.

03

Enhances cross-domain person re-identification performance.

Abstract

Recently, the Transformer module has been transplanted from natural language processing to computer vision. This paper applies the Transformer to video-based person re-identification, where the key issue is to extract the discriminative information from a tracklet. We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting, arguably due to a large number of attention parameters and insufficient training data. To solve this problem, we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gait Recognition and Analysis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Attention Is All You Need · Dropout · Layer Normalization · Residual Connection · Label Smoothing