C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models
Nayoung Oh, Dohyun Kim, Junhyeong Bang, Rohan Paul, Daehyung Park

TL;DR
C2F-Space introduces a coarse-to-fine framework for spatial instruction grounding that combines vision-language models with superpixel refinement, significantly improving accuracy in complex spatial reasoning tasks.
Contribution
The paper presents a novel two-step space-grounding framework that enhances vision-language models with superpixel-based refinement for precise spatial localization.
Findings
Outperforms five state-of-the-art baselines in success rate and IoU.
Effectiveness of each module confirmed through ablation studies.
Demonstrated applicability in robotic pick-and-place tasks.
Abstract
Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
