Controlling the World by Sleight of Hand
Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick,, Richard Zemel

TL;DR
This paper introduces CosHand, a generative model trained on unlabeled videos that predicts the effects of hand-object interactions, enabling realistic image synthesis of future states conditioned on specific actions.
Contribution
It presents a novel action-conditional generative model that learns from unlabeled videos to predict and synthesize future images after hand-object interactions.
Findings
Strong generalization to unseen objects and environments
Ability to generate multiple possible future outcomes
Effective modeling of interaction uncertainty
Abstract
Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Robot Manipulation and Learning
