LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo; Jiahe Liu; Wenyu Gao; Yushan Li; Chengzhi Li; Ping Jian

arXiv:2512.01008·cs.CV·December 2, 2025

LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo, Jiahe Liu, Wenyu Gao, Yushan Li, Chengzhi Li, Ping Jian

PDF

Open Access

TL;DR

LISA-3D introduces a novel framework that enhances language-guided 3D reconstruction by leveraging multi-view consistency and existing models, achieving significant accuracy improvements without extensive retraining.

Contribution

The paper presents LISA-3D, a two-stage method that adapts language-image segmentation models for 3D tasks using geometry-aware LoRA layers and a differentiable reprojection loss, enabling zero-shot 3D content creation.

Findings

01

Up to +15.6 points improvement in language-to-3D accuracy.

02

Efficient adaptation using only 11.6M parameters.

03

Supports zero-shot deployment on unseen categories.

Abstract

Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis