ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer
Eslam Mohamed, Ahmad El-Sallab

TL;DR
ST-DETR introduces a spatio-temporal transformer architecture for object detection in video sequences, leveraging full attention mechanisms and a novel temporal positional embedding to improve moving object detection accuracy.
Contribution
The paper presents a novel spatio-temporal transformer architecture with a new temporal positional embedding for enhanced object detection in videos.
Findings
Achieved 5% mAP improvement on KITTI MOD dataset.
Effectively models object traces over space and time.
Demonstrates the importance of temporal features in detection.
Abstract
We propose ST-DETR, a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames. We treat the temporal frames as sequences in both space and time and employ the full attention mechanisms to take advantage of the features correlations over both dimensions. This treatment enables us to deal with frames sequence as temporal object features traces over every location in the space. We explore two possible approaches; the early spatial features aggregation over the temporal dimension, and the late temporal aggregation of object query spatial features. Moreover, we propose a novel Temporal Positional Embedding technique to encode the time sequence information. To evaluate our approach, we choose the Moving Object Detection (MOD)task, since it is a perfect candidate to showcase the importance of the temporal dimension. Results show a significant 5%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
