TL;DR
OmniPlay is a comprehensive benchmark designed to evaluate and analyze the multi-modal reasoning and fusion capabilities of agentic models across diverse interactive game environments, revealing strengths and weaknesses in current models.
Contribution
Introduces OmniPlay, a novel diagnostic benchmark with interactive game environments to test omni-modal models' reasoning and fusion capabilities across sensory modalities.
Findings
Models excel in memory tasks but struggle with reasoning and planning.
Fusion mechanisms are brittle and cause performance failures.
Removing sensory input can sometimes improve model performance.
Abstract
While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper presents a novel benchmark to probe omni-modal models in their actual effective use of different modalities. The use of targeted games is also especially key, as mentioned by authors, it allow for better experiment desing and traditional benchmarks fail to test model capabilities in both dynamic and interactive worlds. The stated goal to "explicitly address the foundational challenges of synergistic fusion, conflict arbitration, and resilient reasoning" is an important impactful targe
While the paper tackles significant and timely issues and proposes an original benchmark with potential, the current manuscript suffers from not fully adequate presentation and soundness issues for some of its conclusions. First, the paper presentation seems a bit backwards. It starts talking about specific experiment details without having really described the games, the core benchmark design (beyond just some principles), and not showing overall results. This makes the paper quite hard to int
- The motivation of the paper is clear, where there is a lack of benchmarks that test agency with a rich, multi-sensory environment. - There is solid open-source contribution for both the environments and evaluation protocols, and the appendix shows extensive details. - The paper is well-written.
- Some tasks, while creative, appear to test very specific, narrow forms of reasoning. The Alchemist's Melody (rule discovery) can be reduced to a trial-and-error association problem where no reasoning is really involved. The Myriad Echoes and the Whispered Pathfinding are essentially complex perception-and-grounding tasks. They are not really testing "strategic planning" and "robust reasoning". - The paper heavily contrasts "brittle reasoning" with "superhuman memory". This "dichotomy" is not r
* This paper tackles the inadequacy of current benchmarks, which are either static (lacking agency) or interactive but modally limited (ignoring audio, etc.), by introducing an interactive benchmark designed for *omni-modal* agents using image, video, audio, and text. * It introduces five distinct, newly developed game environments, each crafted to test different capabilities (e.g., navigation, sequence replication, abstract reasoning, strategy) under varying modality combinations and complexit
- The games and scenarios involving modality complementarity and conflict are custom-designed based on the authors' principles. The "naturalness" or representativeness of these specific interaction patterns for general real-world tasks could be debated. - The paper highlights "superhuman memory" in the Myriad Echoes task. However, the qualitative analysis suggests this is largely due to the AI's perfect recall compared to human cognitive limits on working memory for long, arbitrary sequences, r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
