Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level   Policies in Atari Games

Nicholas R. Waytowich; Devin White; MD Sunbeam; Vinicius G. Goecks

arXiv:2408.15950·cs.AI·December 3, 2024

Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games

Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks

PDF

Open Access

TL;DR

This paper introduces a benchmark to evaluate multimodal large language models as low-level controllers in Atari games, highlighting their current limitations in visual and spatial reasoning compared to traditional methods.

Contribution

It presents a novel benchmark for testing multimodal LLMs as low-level policies in Atari, comparing their performance to RL agents and humans, and analyzing their reasoning capabilities.

Findings

01

Multimodal LLMs are not yet effective zero-shot low-level policies.

02

Visual and spatial reasoning limitations hinder LLM performance.

03

Traditional RL agents outperform multimodal LLMs in Atari tasks.

Abstract

Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. In this paper, we introduce a novel benchmark aimed at testing the emergent capabilities of multimodal LLMs as low-level policies in Atari games. Unlike traditional reinforcement learning (RL) methods that require training for each new environment and reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses the performances of multiple multimodal LLMs against traditional RL agents, human players, and random agents, focusing on their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling