OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Fuqing Bie; Shiyu Huang; Xijia Tao; Zhiqin Fang; Leyi Pan; Junzhe Chen; Min Ren; Liuyu Xiang; Zhaofeng He

arXiv:2508.04361·cs.AI·September 30, 2025

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

Fuqing Bie, Shiyu Huang, Xijia Tao, Zhiqin Fang, Leyi Pan, Junzhe Chen, Min Ren, Liuyu Xiang, Zhaofeng He

PDF

3 Reviews

TL;DR

OmniPlay is a comprehensive benchmark designed to evaluate and analyze the multi-modal reasoning and fusion capabilities of agentic models across diverse interactive game environments, revealing strengths and weaknesses in current models.

Contribution

Introduces OmniPlay, a novel diagnostic benchmark with interactive game environments to test omni-modal models' reasoning and fusion capabilities across sensory modalities.

Findings

01

Models excel in memory tasks but struggle with reasoning and planning.

02

Fusion mechanisms are brittle and cause performance failures.

03

Removing sensory input can sometimes improve model performance.

Abstract

While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

The paper presents a novel benchmark to probe omni-modal models in their actual effective use of different modalities. The use of targeted games is also especially key, as mentioned by authors, it allow for better experiment desing and traditional benchmarks fail to test model capabilities in both dynamic and interactive worlds. The stated goal to "explicitly address the foundational challenges of synergistic fusion, conflict arbitration, and resilient reasoning" is an important impactful targe

Weaknesses

While the paper tackles significant and timely issues and proposes an original benchmark with potential, the current manuscript suffers from not fully adequate presentation and soundness issues for some of its conclusions. First, the paper presentation seems a bit backwards. It starts talking about specific experiment details without having really described the games, the core benchmark design (beyond just some principles), and not showing overall results. This makes the paper quite hard to int

Reviewer 02Rating 4Confidence 3

Strengths

- The motivation of the paper is clear, where there is a lack of benchmarks that test agency with a rich, multi-sensory environment. - There is solid open-source contribution for both the environments and evaluation protocols, and the appendix shows extensive details. - The paper is well-written.

Weaknesses

- Some tasks, while creative, appear to test very specific, narrow forms of reasoning. The Alchemist's Melody (rule discovery) can be reduced to a trial-and-error association problem where no reasoning is really involved. The Myriad Echoes and the Whispered Pathfinding are essentially complex perception-and-grounding tasks. They are not really testing "strategic planning" and "robust reasoning". - The paper heavily contrasts "brittle reasoning" with "superhuman memory". This "dichotomy" is not r

Reviewer 03Rating 6Confidence 3

Strengths

* This paper tackles the inadequacy of current benchmarks, which are either static (lacking agency) or interactive but modally limited (ignoring audio, etc.), by introducing an interactive benchmark designed for *omni-modal* agents using image, video, audio, and text. * It introduces five distinct, newly developed game environments, each crafted to test different capabilities (e.g., navigation, sequence replication, abstract reasoning, strategy) under varying modality combinations and complexit

Weaknesses

- The games and scenarios involving modality complementarity and conflict are custom-designed based on the authors' principles. The "naturalness" or representativeness of these specific interaction patterns for general real-world tasks could be debated. - The paper highlights "superhuman memory" in the Myriad Echoes task. However, the qualitative analysis suggests this is largely due to the AI's perfect recall compared to human cognitive limits on working memory for long, arbitrary sequences, r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.