TextAtari: 100K Frames Game Playing with Language Agents

Wenhao Li; Wenwu Li; Chuyun Shen; Junjie Sheng; Zixiao Huang; Di Wu; Yun Hua; Wei Yin; Xiangfeng Wang; Hongyuan Zha; Bo Jin

arXiv:2506.04098·cs.CL·June 11, 2025

TextAtari: 100K Frames Game Playing with Language Agents

Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin

PDF

Open Access 1 Repo

TL;DR

TextAtari introduces a challenging benchmark translating Atari game states into text to evaluate language agents on long-horizon decision tasks, revealing significant gaps compared to human performance.

Contribution

The paper presents TextAtari, a novel textual benchmark for evaluating language models on extended decision-making tasks, bridging Atari game states with natural language processing.

Findings

01

Language agents underperform humans in long-term planning tasks.

02

Different agent frameworks and scenarios significantly affect performance.

03

Current models face challenges in sequential reasoning and strategic planning.

Abstract

We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lww007/Text-Atari-Agents
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · AI-based Problem Solving and Planning