Affordance Grounding from Demonstration Video to Target Image

Joya Chen; Difei Gao; Kevin Qinghong Lin; Mike Zheng Shou

arXiv:2303.14644·cs.CV·March 28, 2023·1 cites

Affordance Grounding from Demonstration Video to Target Image

Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou

PDF

Open Access 1 Repo

TL;DR

This paper introduces Afformer, a transformer-based model with self-supervised pre-training, that improves the grounding of human hand affordances from demonstration videos to target images, crucial for robotic and AR applications.

Contribution

The paper presents a novel transformer-based model and a self-supervised pre-training method that significantly enhance affordance grounding from videos to images, addressing data scarcity and fine-grained prediction challenges.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Improves performance by 37% on the OPRA dataset.

03

Demonstrates effective generalization across video-image discrepancies.

Abstract

Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pre-training technique for synthesizing video-image data and simulating context changes, enhancing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/afformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Dense Connections