AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

Kai Zhang; Xingyu Chen; Xiaofeng Zhang

arXiv:2505.12782·cs.GR·May 20, 2025

AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

Kai Zhang, Xingyu Chen, Xiaofeng Zhang

PDF

Open Access

TL;DR

AdaToken-3D introduces a dynamic spatial token pruning method that significantly improves the efficiency of large 3D multimodal models by reducing computational costs while preserving accuracy.

Contribution

This work presents a novel adaptive spatial token pruning framework for 3D multimodal models, systematically analyzing redundancy patterns and enhancing inference efficiency.

Findings

01

Achieves 21% faster inference speed

02

Reduces 63% FLOPs without accuracy loss

03

Over 60% of spatial tokens are minimally contributive

Abstract

Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Focus · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings