From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Yifu Yuan; Haiqin Cui; Yibin Chen; Zibin Dong; Fei Ni; Longxin Kou; Jinyi Liu; Pengyi Li; Yan Zheng; Jianye Hao

arXiv:2505.08548·cs.RO·April 7, 2026

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao

PDF

1 Video

TL;DR

This paper introduces FSD, a vision-language model that enhances robotic manipulation by reasoning about spatial relationships, leading to improved zero-shot generalization and success rates in various benchmarks and real-world tasks.

Contribution

FSD is a novel model that integrates spatial reasoning with vision-language understanding to improve robotic manipulation and generalization in unseen scenarios.

Findings

01

FSD achieves 40.6% success in SimplerEnv.

02

FSD attains 72% success across 8 real-world tasks.

03

Outperforms baseline methods by 30% in success rate.

Abstract

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation· slideslive