Not 3D Re-ID: a Simple Single Stream 2D Convolution for Robust Video Re-identification
Toby P. Breckon, Aishah Alsehaim

TL;DR
This paper demonstrates that a simple single stream 2D convolutional network using ResNet50-IBN architecture can outperform complex 3D CNN-based methods in video person re-identification tasks, achieving state-of-the-art results.
Contribution
The authors propose a straightforward 2D convolution approach with temporal attention that surpasses complex architectures for video Re-ID, emphasizing simplicity and efficiency.
Findings
Achieves 89.62% rank-1 accuracy on MARS dataset.
Outperforms existing methods on PRID2011 and iLIDS-VID datasets.
Uses transfer learning and simple averaging for video feature extraction.
Abstract
Video-based person re-identification has received increasing attention recently, as it plays an important role within surveillance video analysis. Video-based Re-ID is an expansion of earlier image-based re-identification methods by learning features from a video via multiple image frames for each person. Most contemporary video Re-ID methods utilise complex CNNbased network architectures using 3D convolution or multibranch networks to extract spatial-temporal video features. By contrast, in this paper, we illustrate superior performance from a simple single stream 2D convolution network leveraging the ResNet50-IBN architecture to extract frame-level features followed by temporal attention for clip level features. These clip level features can be generalised to extract video level features by averaging without any significant additional cost. Our approach uses best video Re-ID practice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution · 3D Convolution
