Cascade Transformers for End-to-End Person Search
Rui Yu, Dawei Du, Rodney LaLonde, Daniel Davila, Christopher Funk,, Anthony Hoogs, Brian Clipp

TL;DR
This paper introduces COAT, a three-stage cascade transformer model that progressively refines person detection and re-identification, effectively handling occlusions and variations to achieve state-of-the-art results in person search.
Contribution
The paper presents a novel cascade occluded attention transformer that refines person search through multiple stages, incorporating occluded attention to improve robustness against occlusions and pose variations.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Effectively handles occlusions and pose variations.
Demonstrates the benefit of multi-stage refinement in person search.
Abstract
The goal of person search is to localize a target person from a gallery set of scene images, which is extremely challenging due to large scale variations, pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search. Our three-stage cascade design focuses on detecting people in the first stage, while later stages simultaneously and progressively refine the representation for person detection and re-identification. At each stage the occluded attention transformer applies tighter intersection over union thresholds, forcing the network to learn coarse-to-fine pose/scale invariant features. Meanwhile, we calculate each detection's occluded attention to differentiate a person's tokens from other people or the background. In this way, we simulate the effect of other objects occluding a person of interest at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Face recognition and analysis · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dense Connections · Dropout
