TrackFormer: Multi-Object Tracking with Transformers
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph, Feichtenhofer

TL;DR
TrackFormer is a novel end-to-end Transformer-based approach for multi-object tracking that uses attention mechanisms to associate objects across frames without additional motion or appearance modeling.
Contribution
It introduces a new tracking-by-attention paradigm with a set prediction framework for MOT, eliminating the need for graph optimization or explicit motion modeling.
Findings
Achieves state-of-the-art results on MOT17 and MOT20 datasets.
Effectively maintains object identities over sequences.
Simplifies multi-object tracking with a unified Transformer architecture.
Abstract
The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the conceptually new and identity preserving track queries. Both query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Data Stream Mining Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Convolution · Softmax · Dropout · Byte Pair Encoding · Dense Connections · Label Smoothing · Attention Is All You Need
