LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Anian Ruoss; Fabio Pardo; Harris Chan; Bonnie Li; Volodymyr Mnih; Tim Genewein

arXiv:2412.01441·cs.AI·May 26, 2025

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein

PDF

Open Access 1 Video

TL;DR

This paper introduces a comprehensive benchmark to evaluate large multimodal models' ability to learn from long, diverse demonstrations across various decision-making tasks, revealing current limitations and potential directions for improvement.

Contribution

It provides the first benchmark for assessing multimodal models' in-context learning with very long demonstrations across multiple tasks and modalities, and analyzes factors influencing performance.

Findings

01

Models rarely reach expert performance even with many demonstrations.

02

Adding more demonstrations often has little effect on performance.

03

Encoding observations as text or images and chain-of-thought prompting impact results.

Abstract

In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $\unicode x 2013$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning