Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling

William Xie; Max Conway; Yutong Zhang; Nikolaus Correll

arXiv:2505.09731·cs.RO·May 16, 2025

Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling

William Xie, Max Conway, Yutong Zhang, Nikolaus Correll

PDF

Open Access

TL;DR

This paper introduces a novel method for enabling vision language models to reason about physical forces and interactions in manipulation tasks by overlaying coordinate frame labels, achieving zero-shot generalization and continual reasoning across multiple robotic tasks.

Contribution

The work presents a new framework that uses visual coordinate frame labeling to allow VLMs to explicitly reason about forces, leading to improved zero-shot manipulation and failure recovery.

Findings

01

Achieved 51% success rate across four manipulation tasks.

02

Enabled VLMs to recover from task failures without human supervision.

03

Demonstrated generalization across different robots and perspectives.

Abstract

Vision language models (VLMs) exhibit vast knowledge of the physical world, including intuition of physical and spatial properties, affordances, and motion. With fine-tuning, VLMs can also natively produce robot trajectories. We demonstrate that eliciting wrenches, not trajectories, allows VLMs to explicitly reason about forces and leads to zero-shot generalization in a series of manipulation tasks without pretraining. We achieve this by overlaying a consistent visual representation of relevant coordinate frames on robot-attached camera images to augment our query. First, we show how this addition enables a versatile motion control framework evaluated across four tasks (opening and closing a lid, pushing a cup or chair) spanning prismatic and rotational motion, an order of force and position magnitude, different camera perspectives, annotation schemes, and two robot platforms over 220…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning