Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions
Tianmin Shu, Xiaofeng Gao, Michael S. Ryoo, Song-Chun Zhu

TL;DR
This paper introduces a framework for learning social affordance grammar from RGB-D videos, enabling robots to infer and perform human-like interactions in real-time, with demonstrated effectiveness in simulation and real-world tests.
Contribution
It presents a novel weakly supervised method to learn hierarchical social affordance grammar as an ST-AOG from RGB-D videos for human-robot interaction.
Findings
Successfully generates human-like behaviors in unseen scenarios
Outperforms baseline methods in experiments
Enables real-time motion inference for humanoid robots
Abstract
In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI). Based on Gibbs sampling, our weakly supervised grammar learning can automatically construct a hierarchical representation of an interaction with long-term joint sub-tasks of both agents and short term atomic actions of individual agents. Based on a new RGB-D video dataset with rich instances of human interactions, our experiments of Baxter simulation, human evaluation, and real Baxter test demonstrate that the model learned from limited training data successfully generates human-like behaviors in unseen scenarios and outperforms both baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications
