Sparse4D v2: Recurrent Temporal Fusion with Sparse Model
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, Zhizhong Su

TL;DR
Sparse4D v2 introduces a recursive, recurrent temporal fusion method for sparse perception that significantly reduces computational complexity and enhances long-term information integration, achieving state-of-the-art results in 3D detection.
Contribution
It presents an improved Sparse4D with a recursive temporal fusion module that reduces complexity and enables long-term feature integration for better perception performance.
Findings
Reduces temporal fusion complexity from O(T) to O(1).
Achieves state-of-the-art results on nuScenes 3D detection.
Improves inference speed and memory efficiency.
Abstract
Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from to , resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Photoacoustic and Ultrasonic Imaging · Image Enhancement Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
