Learning the Effects of Physical Actions in a Multi-modal Environment

Gautier Dagan; Frank Keller; Alex Lascarides

arXiv:2301.11845·cs.CL·February 6, 2023

Learning the Effects of Physical Actions in a Multi-modal Environment

Gautier Dagan, Frank Keller, Alex Lascarides

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-modal approach combining visual and textual data to improve Large Language Models' ability to predict the outcomes of physical actions in an environment, enhancing their commonsense reasoning.

Contribution

It extends LLMs with visual inputs and object representations to better model physical effects, addressing limitations of disembodied training.

Findings

01

Multi-modal models improve prediction accuracy of action outcomes.

02

Combining images and text enhances generalization to new actions and objects.

03

Models better capture physical commonsense with visual augmentation.

Abstract

Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gautierdag/piglet-vis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications

Methodsfail