Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan

TL;DR
Robin3D advances 3D large language models by utilizing a novel robust instruction generation approach, significantly improving their discriminative power and generalization across multiple benchmarks without task-specific fine-tuning.
Contribution
The paper introduces Robin3D, a new 3DLLM trained on large-scale, robust instruction-following data generated by a novel data engine, enhancing model understanding and performance.
Findings
Achieved 7.8% improvement in grounding task (Multi3DRefer)
Achieved 6.9% improvement in captioning task (Scan2Cap)
Outperforms previous methods across five benchmarks
Abstract
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
• The paper introduces a large-scale 3D scene-instruction dataset that includes diverse instruction types, integrating varied instruction styles, existing benchmark instructions, and challenging adversarial instructions, enhancing the model’s robustness and generalization.\ • It proposes novel architectures that effectively leverage both 2D and 3D object-centric features, enabling richer spatial understanding and stronger object-grounding capabilities in complex 3D environments.
• Relies on off-the-shelf 3D instance segmentation models trained on ScanNet with closed-set categories. I recommend the authors to consider Segment3D [1] for open-vocab, class-agnostic segmentation.\ • Cropping instance-level point clouds and applying object-level 3D point cloud CLIP (Uni3D) can limit the receptive fields and be computationally heavy. I recommend the authors to try scene-level CLIP (OpenScene [2], RegionPLC[3]) and then cropping the output features.\ • Table 1 reports only trad
1. Reasonable motivation - Expands existing ScanNet 3D text annotations through the data engine. 2. Strong experimental results - Demonstrates excellent performance. 3. Clear and complete paper writing.
I have some questions about this paper that need further discussion. Please see them below. If the authors can address my concerns, I am willing to raise my score.
1. The paper targets the challening problem of 3D LLM for ground task as well as caption task. 2. To address the problem, the paper presents a robust instruction generation engine and 1M instruction-following data has been presented. 3. The paper obtains promising experimental results on five 3D multimoal learning benchmarks.
1. Will the 1M 3D instruction dataset be release to the public? The main contribution of the paper lies on the datasets, thus whether the dataset will be released to public is important to evaluate the contribution of the paper. 2. The dataset seems to be designed specifially for the 3D indoor environment. How about the generation ability of the dataset and the model used for the outdoor environment, like the 3D street? 3. Is it possible to provide an ablation study on different of training exam
1. The RIG engine's capability to generate adversarial and diverse instruction data significantly enhances the robustness and generalizability of 3DLLMs. The innovative proposal of adversarial data may help mitigate the hallucination tendencies of large models. The collection of diverse instructions, expanded by GPT to enrich the diversity of expressions, may alleviate the issue of rigid model outputs. 2. The integration of RAP and IFB modules improves the model's spatial understanding and objec
1. The module's innovativeness is found to be lacking: RAP utilizes linear layers to separately connect the 3D features from the scene, individual object 3D features, and positional information features, followed by concatenation. A possible baseline (chat-scene) employs the exact same encoders, using linear layers to connect 3D features and positional features, and then concatenating individual object 3D features. The only modification made is the interchange of inputs to the linear layers. Sim
1. This paper constructs a large instruction-following fine-tuning dataset containing adversarial and diverse samples. 2. The zero-shot performance improvement of the trained Robin3D appears evident across various benchmarks and the ablation experiments clearly demonstrate the gains of different designs in the paper. 3. The writing of the article is fluent and easy to understand.
1. The related work section lacks clarity on the novelty and advantages of the RAP and IFB modules in comparison to existing studies. (1) Explain how object IDs are linked to object features in previous research and discuss the benefits of wrapping these features with identical ID tokens before and after them. (2) Describe how earlier studies extract and utilize 3D and 2D features, and highlight the advantages of introducing Mask3D information using RAP. 2. How will the relative proportions of
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
MethodsSparse Evolutionary Training
