Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Yifan Zhang; Zhiyu Zhu; Junhui Hou; Dapeng Wu

arXiv:2307.00347·cs.CV·August 21, 2025

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Yifan Zhang, Zhiyu Zhu, Junhui Hou, Dapeng Wu

PDF

Open Access 1 Repo

TL;DR

STEMD introduces a novel framework that enhances multi-frame 3D object detection by integrating spatial-temporal graph attention, previous frame outputs, and IoU regularization to improve accuracy and handle complex scenarios.

Contribution

The paper proposes a new end-to-end framework that combines graph attention, temporal information, and query regularization for improved multi-frame 3D detection.

Findings

01

Effective modeling of object interactions with graph attention.

02

Improved detection accuracy in challenging scenarios.

03

Minor computational overhead compared to existing methods.

Abstract

The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection by addressing three key aspects specifically tailored for this task. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network, which represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. To solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eaphan/stemd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Residual Connection