Learning the Effects of Physical Actions in a Multi-modal Environment
Gautier Dagan, Frank Keller, Alex Lascarides

TL;DR
This paper introduces a multi-modal approach combining visual and textual data to improve Large Language Models' ability to predict the outcomes of physical actions in an environment, enhancing their commonsense reasoning.
Contribution
It extends LLMs with visual inputs and object representations to better model physical effects, addressing limitations of disembodied training.
Findings
Multi-modal models improve prediction accuracy of action outcomes.
Combining images and text enhances generalization to new actions and objects.
Models better capture physical commonsense with visual augmentation.
Abstract
Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications
Methodsfail
