Affordance-Guided Reinforcement Learning via Visual Prompting
Olivia Y. Lee, Annie Xie, Kuan Fang, Karl Pertsch, Chelsea Finn

TL;DR
This paper introduces KAGI, a method that uses vision-language models to shape dense rewards for reinforcement learning, significantly improving sample efficiency and robustness in robotic manipulation tasks guided by natural language descriptions.
Contribution
KAGI leverages zero-shot affordance reasoning from vision-language models to provide dense rewards, enhancing autonomous RL performance in real-world manipulation tasks.
Findings
KAGI improves sample efficiency in robotic learning tasks.
KAGI enables successful task completion with fewer demonstrations.
KAGI maintains robustness even with reduced pre-training data.
Abstract
Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as human demonstrations of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics that can perform visual reasoning in physical contexts and generate coarse robot motions for manipulation tasks. Motivated by this range of capability, in this work, we present Keypoint-based Affordance Guidance for Improvements (KAGI), a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated impressive zero-shot reasoning about affordances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
