3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Ting Huang, Zeyu Zhang, Hao Tang

TL;DR
3D-R1 is a new foundation model that significantly improves reasoning and generalization in 3D vision-language tasks by using synthetic data, reinforcement learning, and adaptive view selection.
Contribution
The paper introduces 3D-R1, a novel 3D VLM that incorporates synthetic dataset creation, RLHF training, and dynamic view selection to enhance reasoning capabilities.
Findings
Achieves 10% average improvement on 3D scene benchmarks.
Effectively enhances reasoning and generalization in 3D scene understanding.
Demonstrates robustness across various 3D VLM tasks.
Abstract
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. 3D-R1 demonstrates clear performance improvements over prior SOTA methods across various demanding 3D reasoning benchmarks. 2. The paper provides detailed descriptions of the model architecture and experimental setup. The supplementary material, including the provided code, further enhances the potential for reproducibility.
1. The overall approach (CoT data collection followed by GRPO) appears somewhat conventional for enhancing VLM reasoning. And the data collection method is straightforward. 2. The "Experiments" section primarily focuses on setup and metrics, lacking in-depth summarization and analysis of the quantitative results. The rich ablation studies and qualitative results in the Appendix also suffer from a similar lack of interpretation. 3. The paper's layout is inefficient, as several important component
* The proposed method is conceptually simple and clearly motivated. * The paper is well-written and easy to follow. * Extensive experiments demonstrate strong performance, achieving state-of-the-art results on several standard 3D scene benchmarks.
* Lack of Related Work Discussion: The paper lacks a comprehensive related works section that situates this research within the broader context of 3D-VLM and reasoning-based approaches. A deeper comparison with existing 3D reasoning or view selection methods would strengthen the positioning of this work. * Insufficient Methodological Details: * In Section 2.2 (CoT Data Engine), the authors mention using “a pre-trained 3D VLM that produces a concise textual summary of the scene.” Howev
1. The paper writen is clear and easy-to-follow. 2. The paper achieves performance improvements over current models. 3. The experiments show that the proposal modules can effectively imrpove model performance, e.g., RL rewards and view selection.
1. Recent papers, such as VG-LLM [1] and 3DRS [2], have not been cited or compared. 2. In terms of technical novelty, the proposed GRPO algorithm is a straightforward extension of the original, but lacks clear innovation compared to recent variants like Visual-RFT [3]. 3. While the paper introduces multiple encoders for visual feature extraction, it does not provide FLOPs or inference speed comparisons. Additionally, most prior methods employ only a single encoder. [1] Learning from Videos fo
1. The integration of reinforcement learning (GRPO) into 3D vision-language training is essential. The use of multi-reward signals to align reasoning, perception, and semantic accuracy is a clear conceptual advancement. 2. The Scene-30K dataset, generated with Gemini 2.5 Pro and structured CoT reasoning, is a valuable resource for promoting step-by-step spatial reasoning in 3D. 3. The experiments are comprehensive and detailed across diverse tasks.
1. The discussion of related work is insufficient. Several recent and more advanced 3D multimodal LLMs—such as **Inst3D-LMM (CVPR 2025)** and **Video-3D LLM (CVPR 2025)**—are not discussed or compared. A more comprehensive review and comparison would strengthen the paper. 2. The writing quality, as well as the presentation and layout of figures and tables, still have considerable room for improvement. 3. The paper lacks ablation studies to validate the generalization ability and effectiveness of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Industrial Vision Systems and Defect Detection · Advanced Vision and Imaging
