From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao; Sergio Arnaud; Oleksandr Maksymets; Jianing Yang; Ayush Jain; Sriram Yenamandra; Ada Martin; Vincent-Pierre Berges; Paul McVay; Ruslan Partsey; Aravind Rajeswaran; Franziska Meier; Justin Johnson; Jeong Joon Park; Alexander Sax

arXiv:2502.20389·cs.CV·June 10, 2025

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Sriram Yenamandra, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

PDF

Open Access 1 Video

TL;DR

LIFT-GS introduces a differentiable rendering-based distillation method that leverages 2D foundation models to improve 3D vision-language grounding without requiring 3D annotations, achieving state-of-the-art results.

Contribution

It proposes a novel render-supervised distillation approach that bridges 3D and 2D models, enabling effective training of 3D vision-language models with limited 3D data.

Findings

01

Achieves 25.7% mAP on open-vocabulary instance segmentation.

02

Demonstrates 10-30% improvements on referential grounding tasks.

03

Pretraining doubles effective dataset size, showing strong scaling.

Abstract

3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $LIFT-GS$ , a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7%$ mAP on open-vocabulary instance segmentation (vs. $20.2%$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis