DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

Rajeev Yasarla; Shizhong Han; Hong Cai; Fatih Porikli

arXiv:2506.10242·cs.CV·June 13, 2025

DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos

Rajeev Yasarla, Shizhong Han, Hong Cai, Fatih Porikli

PDF

Open Access

TL;DR

DySS introduces a novel approach combining state-space learning and dynamic query updates to improve 3D object detection efficiency and accuracy from multi-camera videos in autonomous driving.

Contribution

It proposes a new method that uses state-space models and dynamic queries to enhance detection performance and inference speed in BEV 3D object detection.

Findings

01

Achieves 65.31 NDS and 57.4 mAP on nuScenes test split.

02

Runs at 33 FPS in real-time.

03

Outperforms previous state-of-the-art methods.

Abstract

Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training