Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Han Li, Zehao Huang, Jiahui Fu, Naiyan Wang, Si Liu

TL;DR
This paper introduces Geo3DPruner, a geometry-guided method for efficiently pruning visual tokens in 3D spatial videos, maintaining high performance while reducing tokens by 90%.
Contribution
The paper proposes a novel geometry-aware pruning framework that models cross-frame relevance and performs a two-stage token selection to improve efficiency in 3D scene understanding.
Findings
Retains over 90% of original performance after pruning 90% of tokens.
Outperforms existing pruning methods on multiple 3D scene benchmarks.
Significantly reduces computational load in 3D video-language models.
Abstract
Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
