Future Object Detection with Spatiotemporal Transformers
Adam Tonderski, Joakim Johnander, Christoffer Petersson, and Kalle, {\AA}str\"om

TL;DR
This paper introduces Future Object Detection, a task to predict future bounding boxes in videos using an extended detection transformer that captures scene dynamics and ego-motion, achieving strong results with minimal annotations.
Contribution
It proposes an end-to-end detection transformer model for future object detection, extending it with spatiotemporal processing and ego-motion integration, which improves prediction accuracy.
Findings
Achieves prediction accuracy comparable to an oracle for up to 100 ms.
Outperforms baseline methods for longer prediction horizons.
Visualizations suggest emergent tracking within the model.
Abstract
We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kinematic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection transformer is trained to directly output the future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both object motion and the movement of the ego-camera. To this end, we extend existing detection transformers in two ways. First, we experiment with three different mechanisms that enable the network to spatiotemporally process multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
