Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers
Tianyu Zhu, Markus Hiller, Mahsa Ehsanpour, Rongkai Ma, Tom Drummond,, Ian Reid, Hamid Rezatofighi

TL;DR
This paper introduces MO3TR, an end-to-end Transformer-based multi-object tracking framework that effectively handles occlusions and long-term temporal dependencies without explicit data association, improving tracking accuracy.
Contribution
The paper presents MO3TR, a novel Transformer-based framework that encodes long-term temporal information and jointly estimates object states without explicit data association modules.
Findings
Achieves state-of-the-art or comparable results on multiple MOT benchmarks.
Effectively handles occlusions through long-term temporal embeddings.
Eliminates the need for explicit data association in multi-object tracking.
Abstract
Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion, in part because they ignore long-term temporal information. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Image Enhancement Techniques · Air Quality Monitoring and Forecasting
