Controlling the World by Sleight of Hand

Sruthi Sudhakar; Ruoshi Liu; Basile Van Hoorick; Carl Vondrick,; Richard Zemel

arXiv:2408.07147·cs.CV·August 15, 2024

Controlling the World by Sleight of Hand

Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick,, Richard Zemel

PDF

Open Access

TL;DR

This paper introduces CosHand, a generative model trained on unlabeled videos that predicts the effects of hand-object interactions, enabling realistic image synthesis of future states conditioned on specific actions.

Contribution

It presents a novel action-conditional generative model that learns from unlabeled videos to predict and synthesize future images after hand-object interactions.

Findings

01

Strong generalization to unseen objects and environments

02

Ability to generate multiple possible future outcomes

03

Effective modeling of interaction uncertainty

Abstract

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Robot Manipulation and Learning