Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Mingqian Ji; Shanshan Zhang; and Jian Yang

arXiv:2604.14563·cs.CV·April 17, 2026

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Mingqian Ji, Shanshan Zhang, and Jian Yang

PDF

1 Repo

TL;DR

SEPatch3D introduces a dynamic patch size adjustment framework for ViT-based 3D object detection, improving inference speed and efficiency while maintaining detection accuracy.

Contribution

It proposes a novel method combining adaptive patch sizing, informative patch selection, and feature enhancement to address limitations of existing token compression strategies.

Findings

01

Achieves up to 57% faster inference than StreamPETR.

02

Provides 20% higher efficiency compared to ToC3D-faster.

03

Maintains comparable detection accuracy on nuScenes and Argoverse 2.

Abstract

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Mingqj/SEPatch3D
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.