Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han

TL;DR
This paper introduces a novel reinforcement learning framework for Point-Vision-Language Models that improves 3D structure accuracy and physical consistency by disentangling supervision signals and enforcing geometric constraints.
Contribution
It proposes Geometric Reward Credit Assignment and a Reprojection-Consistency term to enhance 3D spatial reasoning and physical plausibility in point-based vision-language models.
Findings
Boosted 3D KPA from 0.64 to 0.93
Increased 3D bounding box IoU to 0.686
Raised reprojection consistency to 0.852
Abstract
Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
