Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, Laura Sevilla-Lara

TL;DR
This paper presents a system that learns precise object affordances from egocentric videos to improve robotic manipulation, addressing data scarcity, generalization, and real-world deployment challenges.
Contribution
It introduces a novel affordance learning framework that uses egocentric videos and geometric info to enhance generalization and deployability in robotic tasks.
Findings
Outperforms state-of-the-art by 13.8% in mIoU
Achieves 77.1% success rate in robotic grasping
Effective on seen, unseen, and cluttered scenes
Abstract
Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotic Mechanisms and Dynamics · Reinforcement Learning in Robotics
MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax
