Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Chengshuai Shi; Wenzhe Li; Xinran Liang; Yizhou Lu; Wenjia Yang; Ruirong Feng; Seth Karten; Ziran Yang; Zihan Ding; Gabriel Sarch; Danqi Chen; Karthik Narasimhan; Chi Jin

arXiv:2605.00347·cs.LG·May 4, 2026

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin

PDF

TL;DR

This paper introduces Odysseus, a framework that enhances vision-language models for long-horizon decision-making in video games using reinforcement learning, achieving significant progress and stability.

Contribution

It presents an adapted PPO algorithm with a turn-level critic and demonstrates the effectiveness of pretrained VLMs in long-term game environments, advancing RL training stability and sample efficiency.

Findings

01

Odysseus achieves at least 3 times more game progress than previous models.

02

Pretrained VLMs significantly improve sample efficiency and reduce manual action engineering.

03

The framework generalizes well across different game levels and settings.

Abstract

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.