HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models
Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu

TL;DR
This paper introduces HCC-3D, a hierarchical compression method that reduces 3D token processing by about 98% in vision-language models, significantly improving efficiency while maintaining high performance.
Contribution
HCC-3D is a novel hierarchical compression framework that effectively reduces 3D tokens in vision-language models with minimal information loss.
Findings
Achieves approximately 98% token reduction.
Outperforms previous methods in efficiency and accuracy.
Maintains critical structural and detail information.
Abstract
3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
