Spatially Grounded Long-Horizon Task Planning in the Wild

Sehun Jung; HyunJee Song; Dong-Hee Kim; Reuben Tan; Jianfeng Gao; Yong Jae Lee; Donghyun Kim

arXiv:2603.13433·cs.RO·March 17, 2026

Spatially Grounded Long-Horizon Task Planning in the Wild

Sehun Jung, HyunJee Song, Dong-Hee Kim, Reuben Tan, Jianfeng Gao, Yong Jae Lee, Donghyun Kim

PDF

Open Access

TL;DR

This paper introduces GroundedPlanBench, a new benchmark for evaluating spatially grounded long-horizon task planning in robotics, and proposes V2GP, a framework leveraging real robot videos to enhance planning and grounding capabilities.

Contribution

The paper presents a novel benchmark for spatially grounded planning and a data generation framework that improves robot manipulation planning using real-world videos.

Findings

01

Spatially grounded planning is a major bottleneck for current VLMs.

02

V2GP improves action planning and spatial grounding performance.

03

Validated on benchmark and real-world robot experiments.

Abstract

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics