Training-Free Robot Pose Estimation using Off-the-Shelf Foundational Models
Laurence Liang

TL;DR
This paper explores using pre-existing vision-language models to estimate robot arm joint angles from images without additional training, providing a baseline and analyzing their limitations in real-world scenarios.
Contribution
It introduces a novel approach of leveraging off-the-shelf vision-language models for robot pose estimation, bypassing the need for training.
Findings
Current vision-language models establish a baseline for pose estimation.
Test time and parameter scaling do not significantly improve predictions.
Performance varies between synthetic and real-world data.
Abstract
Pose estimation of a robot arm from visual inputs is a challenging task. However, with the increasing adoption of robot arms for both industrial and residential use cases, reliable joint angle estimation can offer improved safety and performance guarantees, and also be used as a verifier to further train robot policies. This paper introduces using frontier vision-language models (VLMs) as an ``off-the-shelf" tool to estimate a robot arm's joint angles from a single target image. By evaluating frontier VLMs on both synthetic and real-world image-data pairs, this paper establishes a performance baseline attained by current FLMs. In addition, this paper presents empirical results suggesting that test time scaling or parameter scaling alone does not lead to improved joint angle predictions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems
