Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
Danish Nazir, Antoine Hanna-Asaad, Lucas G\"ornhardt, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt

TL;DR
This paper introduces a dynamic token selection and fine-tuning method for multi-view 3D object detection using ViT backbones, significantly reducing computation and parameters while improving accuracy.
Contribution
It proposes a dynamic layer-wise token selection mechanism and a parameter-efficient fine-tuning strategy, enhancing efficiency and performance over existing methods like ToC3D.
Findings
Reduces GFLOPs by 48% to 55%
Decreases inference latency by 9% to 25%
Improves detection accuracy and NuScenes score
Abstract
Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
