TrackFormer: Multi-Object Tracking with Transformers

Tim Meinhardt; Alexander Kirillov; Laura Leal-Taixe; Christoph; Feichtenhofer

arXiv:2101.02702·cs.CV·May 2, 2022·6 cites

TrackFormer: Multi-Object Tracking with Transformers

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph, Feichtenhofer

PDF

Open Access 2 Repos

TL;DR

TrackFormer is a novel end-to-end Transformer-based approach for multi-object tracking that uses attention mechanisms to associate objects across frames without additional motion or appearance modeling.

Contribution

It introduces a new tracking-by-attention paradigm with a set prediction framework for MOT, eliminating the need for graph optimization or explicit motion modeling.

Findings

01

Achieves state-of-the-art results on MOT17 and MOT20 datasets.

02

Effectively maintains object identities over sequences.

03

Simplifies multi-object tracking with a unified Transformer architecture.

Abstract

The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the conceptually new and identity preserving track queries. Both query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Data Stream Mining Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Convolution · Softmax · Dropout · Byte Pair Encoding · Dense Connections · Label Smoothing · Attention Is All You Need