Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

Danish Nazir; Antoine Hanna-Asaad; Lucas G\"ornhardt; Jan Piewek; Thorsten Bagdonat; Tim Fingscheidt

arXiv:2604.13586·cs.CV·April 16, 2026

Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

Danish Nazir, Antoine Hanna-Asaad, Lucas G\"ornhardt, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt

PDF

TL;DR

This paper introduces a dynamic token selection and fine-tuning method for multi-view 3D object detection using ViT backbones, significantly reducing computation and parameters while improving accuracy.

Contribution

It proposes a dynamic layer-wise token selection mechanism and a parameter-efficient fine-tuning strategy, enhancing efficiency and performance over existing methods like ToC3D.

Findings

01

Reduces GFLOPs by 48% to 55%

02

Decreases inference latency by 9% to 25%

03

Improves detection accuracy and NuScenes score

Abstract

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.