Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding
Wencan Huang, Daizong Liu, Wei Hu

TL;DR
Fast3D introduces a novel, efficient token pruning framework for 3D multi-modal large language models, significantly reducing computational costs while maintaining scene understanding performance.
Contribution
The paper presents a new plug-and-play token pruning method with global attention prediction and sample-adaptive pruning, tailored for 3D MLLMs, without altering their parameters.
Findings
Effective token pruning at high ratios across benchmarks
Significant reduction in computational complexity
Maintained scene understanding accuracy
Abstract
While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
