Affordance Grounding from Demonstration Video to Target Image
Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou

TL;DR
This paper introduces Afformer, a transformer-based model with self-supervised pre-training, that improves the grounding of human hand affordances from demonstration videos to target images, crucial for robotic and AR applications.
Contribution
The paper presents a novel transformer-based model and a self-supervised pre-training method that significantly enhance affordance grounding from videos to images, addressing data scarcity and fine-grained prediction challenges.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Improves performance by 37% on the OPRA dataset.
Demonstrates effective generalization across video-image discrepancies.
Abstract
Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pre-training technique for synthesizing video-image data and simulating context changes, enhancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Dense Connections
