TL;DR
This paper introduces FSD, a vision-language model that enhances robotic manipulation by reasoning about spatial relationships, leading to improved zero-shot generalization and success rates in various benchmarks and real-world tasks.
Contribution
FSD is a novel model that integrates spatial reasoning with vision-language understanding to improve robotic manipulation and generalization in unseen scenarios.
Findings
FSD achieves 40.6% success in SimplerEnv.
FSD attains 72% success across 8 real-world tasks.
Outperforms baseline methods by 30% in success rate.
Abstract
Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
