TL;DR
This paper introduces PiT, a transformer-based model with multi-directional and multi-scale pyramids that enhances fine-grained feature extraction for video-based pedestrian re-identification, achieving state-of-the-art results.
Contribution
It proposes a novel multi-direction and multi-scale pyramid structure within transformers to better capture fine-grained, part-informed features for pedestrian retrieval.
Findings
Achieves state-of-the-art performance on MARS and iLIDS-VID benchmarks.
Demonstrates the effectiveness of multi-directional and multi-scale pyramids through ablation studies.
Outperforms existing methods in video-based pedestrian re-identification.
Abstract
In video surveillance, pedestrian retrieval (also called person re-identification) is a critical task. This task aims to retrieve the pedestrian of interest from non-overlapping cameras. Recently, transformer-based models have achieved significant progress for this task. However, these models still suffer from ignoring fine-grained, part-informed information. This paper proposes a multi-direction and multi-scale Pyramid in Transformer (PiT) to solve this problem. In transformer-based architecture, each pedestrian image is split into many patches. Then, these patches are fed to transformer layers to obtain the feature representation of this image. To explore the fine-grained information, this paper proposes to apply vertical division and horizontal division on these patches to generate different-direction human parts. These parts provide more fine-grained information. To fuse multi-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Softmax · Layer Normalization · Multi-Head Attention · Dense Connections · Byte Pair Encoding · Dropout · Label Smoothing · Position-Wise Feed-Forward Layer
