Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification
Ziyi Tang, Ruimao Zhang, Zhanglin Peng, Jinrui Chen, Liang Lin

TL;DR
This paper introduces a novel multi-stage Transformer architecture with specialized modules for extracting comprehensive spatial and temporal features, significantly improving video person re-identification accuracy.
Contribution
The paper proposes MSTAT, a multi-stage Transformer with proxy embedding modules and temporal patch shuffling, enhancing local attribute and global identity feature extraction.
Findings
Achieves state-of-the-art accuracy on standard benchmarks.
Effectively extracts discriminative features from videos.
Improves robustness through temporal patch shuffling.
Abstract
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gait Recognition and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Dropout
