Sparse4D v3: Advancing End-to-End 3D Detection and Tracking
Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, Zhizhong Su

TL;DR
This paper enhances the Sparse4D framework for autonomous driving by introducing auxiliary training tasks and structural improvements, significantly boosting 3D detection and tracking performance on the nuScenes benchmark.
Contribution
It proposes two auxiliary training tasks and decoupled attention mechanisms, extending Sparse4D into a tracker with improved accuracy and robustness.
Findings
Achieved 3.0% mAP improvement with ResNet50 backbone
Extended detector into a tracker with high ID assignment accuracy
Validated improvements on nuScenes benchmark with significant performance gains
Abstract
In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0\%, 2.2\%, and 7.6\% in mAP, NDS, and AMOTA, achieving 46.9\%, 56.1\%, and 49.0\%, respectively. Our best model achieved 71.9\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
