TL;DR
Efficient3D introduces a unified framework with token pruning strategies for 3D multimodal models, significantly reducing inference costs while maintaining high accuracy across multiple benchmarks.
Contribution
The paper proposes a novel Debiased Visual Token Importance Estimator and Adaptive Token Rebalancing strategies for efficient 3D MLLMs.
Findings
Achieves +2.57% CIDEr improvement on Scan2Cap.
Reduces inference overhead while maintaining accuracy.
Demonstrates effectiveness across five 3D vision-language benchmarks.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
