Scaffolding Dexterous Manipulation with Vision-Language Models

Vincent de Bakker; Joey Hejna; Tyler Ga Wei Lum; Onur Celik; Aleksandar Taranovic; Denis Blessing; Gerhard Neumann; Jeannette Bohg; Dorsa Sadigh

arXiv:2506.19212·cs.RO·January 13, 2026

Scaffolding Dexterous Manipulation with Vision-Language Models

Vincent de Bakker, Joey Hejna, Tyler Ga Wei Lum, Onur Celik, Aleksandar Taranovic, Denis Blessing, Gerhard Neumann, Jeannette Bohg, Dorsa Sadigh

PDF

Open Access

TL;DR

This paper introduces a method that leverages vision-language models to generate task-relevant trajectories, enabling the training of dexterous robotic manipulation policies in simulation that transfer effectively to real-world robots without demonstrations.

Contribution

The novel approach uses off-the-shelf vision-language models to identify keypoints and synthesize trajectories, reducing the need for explicit reference trajectories and handcrafted rewards in dexterous manipulation.

Findings

01

Successfully trained policies in simulation for complex tasks.

02

Transferred policies to real robots without demonstrations.

03

Demonstrated robustness across multiple articulated object tasks.

Abstract

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning