StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking
Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu

TL;DR
StarBench is a new benchmark for evaluating vision-language models on human-like decision-making and information seeking in a turn-based RPG, highlighting current gaps and potential for agentic behaviors.
Contribution
Introduces StarBench, a comprehensive RPG benchmark for assessing multimodal decision-making and agentic information seeking in realistic game scenarios.
Findings
Current VLMs show significant perception-to-control gaps.
Information seeking improves agent success rates.
Benchmark provides standardized evaluation metrics.
Abstract
Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
