Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games
Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks

TL;DR
This paper introduces a benchmark to evaluate multimodal large language models as low-level controllers in Atari games, highlighting their current limitations in visual and spatial reasoning compared to traditional methods.
Contribution
It presents a novel benchmark for testing multimodal LLMs as low-level policies in Atari, comparing their performance to RL agents and humans, and analyzing their reasoning capabilities.
Findings
Multimodal LLMs are not yet effective zero-shot low-level policies.
Visual and spatial reasoning limitations hinder LLM performance.
Traditional RL agents outperform multimodal LLMs in Atari tasks.
Abstract
Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
