ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer

Eslam Mohamed; Ahmad El-Sallab

arXiv:2107.05887·cs.CV·July 27, 2021·1 cites

ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer

Eslam Mohamed, Ahmad El-Sallab

PDF

Open Access

TL;DR

ST-DETR introduces a spatio-temporal transformer architecture for object detection in video sequences, leveraging full attention mechanisms and a novel temporal positional embedding to improve moving object detection accuracy.

Contribution

The paper presents a novel spatio-temporal transformer architecture with a new temporal positional embedding for enhanced object detection in videos.

Findings

01

Achieved 5% mAP improvement on KITTI MOD dataset.

02

Effectively models object traces over space and time.

03

Demonstrates the importance of temporal features in detection.

Abstract

We propose ST-DETR, a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames. We treat the temporal frames as sequences in both space and time and employ the full attention mechanisms to take advantage of the features correlations over both dimensions. This treatment enables us to deal with frames sequence as temporal object features traces over every location in the space. We explore two possible approaches; the early spatial features aggregation over the temporal dimension, and the late temporal aggregation of object query spatial features. Moreover, we propose a novel Temporal Positional Embedding technique to encode the time sequence information. To evaluate our approach, we choose the Moving Object Detection (MOD)task, since it is a perfect candidate to showcase the importance of the temporal dimension. Results show a significant 5%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications