Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring
Siyuan Wei, Chunjie Wang, Xiao Liu, Xiaosheng Yan, Zhishan Zhou, Rui Huang

TL;DR
Disc3D introduces an automated pipeline that creates high-quality, unambiguous 3D scene-dialogue datasets by combining rule-based methods with large language models, significantly reducing annotation costs.
Contribution
It presents a fully automated, scalable pipeline for generating high-quality 3D dialogue data, addressing viewpoint and object referring ambiguities without human intervention.
Findings
Training with Disc3D improves benchmark performance
Produces over 2 million diverse 3D dialogue samples
Enhances 3D MLLMs across multiple tasks
Abstract
3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
