Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person   Re-identification

Ziyi Tang; Ruimao Zhang; Zhanglin Peng; Jinrui Chen; Liang Lin

arXiv:2301.00531·cs.CV·January 3, 2023

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

Ziyi Tang, Ruimao Zhang, Zhanglin Peng, Jinrui Chen, Liang Lin

PDF

Open Access

TL;DR

This paper introduces a novel multi-stage Transformer architecture with specialized modules for extracting comprehensive spatial and temporal features, significantly improving video person re-identification accuracy.

Contribution

The paper proposes MSTAT, a multi-stage Transformer with proxy embedding modules and temporal patch shuffling, enhancing local attribute and global identity feature extraction.

Findings

01

Achieves state-of-the-art accuracy on standard benchmarks.

02

Effectively extracts discriminative features from videos.

03

Improves robustness through temporal patch shuffling.

Abstract

In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Gait Recognition and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Dropout