Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding

Wencan Huang; Daizong Liu; Wei Hu

arXiv:2507.09334·cs.CV·July 15, 2025

Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding

Wencan Huang, Daizong Liu, Wei Hu

PDF

TL;DR

Fast3D introduces a novel, efficient token pruning framework for 3D multi-modal large language models, significantly reducing computational costs while maintaining scene understanding performance.

Contribution

The paper presents a new plug-and-play token pruning method with global attention prediction and sample-adaptive pruning, tailored for 3D MLLMs, without altering their parameters.

Findings

01

Effective token pruning at high ratios across benchmarks

02

Significant reduction in computational complexity

03

Maintained scene understanding accuracy

Abstract

While 3D Multi-modal Large Language Models (MLLMs) demonstrate remarkable scene understanding capabilities, their practical deployment faces critical challenges due to computational inefficiency. The key bottleneck stems from processing excessive object-centric visual tokens required for comprehensive 3D scene representation. Although visual token pruning has shown promise in accelerating 2D MLLMs, its applicability to 3D domains remains largely unexplored due to fundamental disparities in token structures. In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.