PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan

TL;DR
PointAlign introduces a feature-level alignment regularization for 3D vision-language models, enhancing geometric-semantic preservation and improving classification and captioning performance with minimal additional computation.
Contribution
It proposes a novel feature-level supervision method that explicitly aligns intermediate 3D point cloud tokens with visual tokens, addressing geometric degradation in 3D VLMs.
Findings
Achieves 2.08 percentage point improvement in classification accuracy.
Gains 7.50 percentage points on open-vocabulary Objaverse classification.
Improves 3D object captioning by 4.88 percentage points.
Abstract
The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · 3D Shape Modeling and Analysis
