Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Ashish Malik; Caleb Lowe; Aayam Shrestha; Stefan Lee; Fuxin Li; Alan Fern

arXiv:2603.23676·cs.AI·March 26, 2026

Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

Ashish Malik, Caleb Lowe, Aayam Shrestha, Stefan Lee, Fuxin Li, Alan Fern

PDF

Open Access

TL;DR

This paper introduces RAMP-3D, a mask-based reactive planning system that uses 3D grounding to perform long-horizon box rearrangement tasks from natural language and visual observations, outperforming existing methods.

Contribution

The paper extends 3D grounding models and proposes RAMP-3D, a novel reactive planning approach using paired 3D masks for sequential decision-making in complex environments.

Findings

01

RAMP-3D achieves 79.5% success rate on long-horizon tasks.

02

It significantly outperforms 2D vision-language model baselines.

03

The approach demonstrates the effectiveness of mask-based reactive policies for 3D planning.

Abstract

We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis