C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

Nayoung Oh; Dohyun Kim; Junhyeong Bang; Rohan Paul; Daehyung Park

arXiv:2511.15333·cs.RO·November 20, 2025

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

Nayoung Oh, Dohyun Kim, Junhyeong Bang, Rohan Paul, Daehyung Park

PDF

Open Access

TL;DR

C2F-Space introduces a coarse-to-fine framework for spatial instruction grounding that combines vision-language models with superpixel refinement, significantly improving accuracy in complex spatial reasoning tasks.

Contribution

The paper presents a novel two-step space-grounding framework that enhances vision-language models with superpixel-based refinement for precise spatial localization.

Findings

01

Outperforms five state-of-the-art baselines in success rate and IoU.

02

Effectiveness of each module confirmed through ablation studies.

03

Demonstrated applicability in robotic pick-and-place tasks.

Abstract

Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization