Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng

TL;DR
Proxy3D introduces compact 3D proxy representations for vision-language models, enhancing spatial reasoning and efficiency in 3D understanding from video inputs.
Contribution
It proposes a novel semantic clustering-based 3D proxy representation method and a multi-stage training process for improved spatial reasoning in VLMs.
Findings
Achieves state-of-the-art results in 3D visual question answering.
Demonstrates improved spatial consistency over existing methods.
Efficiently utilizes video sequences for 3D scene understanding.
Abstract
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
