Learning Visual Affordance Grounding from Demonstration Videos
Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao

TL;DR
This paper introduces HAGNet, a novel network that uses hand cues from demonstration videos to improve visual affordance grounding, achieving state-of-the-art results in segmenting interaction regions.
Contribution
It proposes a dual-branch network with hand-aided attention and semantic enhancement to better locate interaction regions by leveraging demonstration videos.
Findings
Achieves state-of-the-art results on two datasets.
Effectively leverages hand cues to improve segmentation.
Outperforms existing appearance-based methods.
Abstract
Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which is beneficial for many applications, such as robot grasping and action recognition. However, existing methods mainly rely on the appearance feature of the objects to segment each region of the image, which face the following two problems: (i) there are multiple possible regions in an object that people interact with; and (ii) there are multiple possible human interactions in the same object region. To address these problems, we propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net has a dual-branch structure to process the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
