Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement
Ashish Malik, Caleb Lowe, Aayam Shrestha, Stefan Lee, Fuxin Li, Alan Fern

TL;DR
This paper introduces RAMP-3D, a mask-based reactive planning system that uses 3D grounding to perform long-horizon box rearrangement tasks from natural language and visual observations, outperforming existing methods.
Contribution
The paper extends 3D grounding models and proposes RAMP-3D, a novel reactive planning approach using paired 3D masks for sequential decision-making in complex environments.
Findings
RAMP-3D achieves 79.5% success rate on long-horizon tasks.
It significantly outperforms 2D vision-language model baselines.
The approach demonstrates the effectiveness of mask-based reactive policies for 3D planning.
Abstract
We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
