LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein

TL;DR
This paper introduces a comprehensive benchmark to evaluate large multimodal models' ability to learn from long, diverse demonstrations across various decision-making tasks, revealing current limitations and potential directions for improvement.
Contribution
It provides the first benchmark for assessing multimodal models' in-context learning with very long demonstrations across multiple tasks and modalities, and analyzes factors influencing performance.
Findings
Models rarely reach expert performance even with many demonstrations.
Adding more demonstrations often has little effect on performance.
Encoding observations as text or images and chain-of-thought prompting impact results.
Abstract
In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
