Exploring the GLIDE model for Human Action-effect Prediction
Fangjun Li, David C. Hogg, Anthony G. Cohn

TL;DR
This paper investigates using the GLIDE generative model to predict the effects of actions in images by masking and inpainting regions based on textual action descriptions, demonstrating qualitative success on an egocentric video dataset.
Contribution
It introduces a novel application of GLIDE for action-effect prediction in images, combining masking and inpainting conditioned on action descriptions.
Findings
Qualitative results show effective scene updates after actions.
Demonstrates potential of generative models for action-effect reasoning.
Uses EPIC dataset for evaluation.
Abstract
We address the following action-effect prediction task. Given an image depicting an initial state of the world and an action expressed in text, predict an image depicting the state of the world following the action. The prediction should have the same scene context as the input image. We explore the use of the recently proposed GLIDE model for performing this task. GLIDE is a generative neural network that can synthesize (inpaint) masked areas of an image, conditioned on a short piece of text. Our idea is to mask-out a region of the input image where the effect of the action is expected to occur. GLIDE is then used to inpaint the masked region conditioned on the required action. In this way, the resulting image has the same background context as the input image, updated to show the effect of the action. We give qualitative results from experiments using the EPIC dataset of ego-centric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation
MethodsGuided Language to Image Diffusion for Generation and Editing
