StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Weihao Tan; Changjiu Jiang; Yu Duan; Mingcong Lei; Jiageng Li; Yitian Hong; Xinrun Wang; Bo An

arXiv:2507.07445·cs.AI·July 14, 2025

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Weihao Tan, Changjiu Jiang, Yu Duan, Mingcong Lei, Jiageng Li, Yitian Hong, Xinrun Wang, Bo An

PDF

TL;DR

StarDojo is a comprehensive benchmark based on Stardew Valley designed to evaluate multimodal large language models in open-ended, production and social interaction tasks, highlighting current limitations of state-of-the-art models.

Contribution

Introduces StarDojo, a new benchmark with 1,000 tasks for assessing multimodal agents in complex, open-ended environments combining production and social skills.

Findings

01

GPT-4.1 achieves only 12.7% success rate

02

Current models struggle with visual understanding and multimodal reasoning

03

Benchmark facilitates research on robust, open-ended agents

Abstract

Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.