Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings
Alexia Jolicoeur-Martineau

TL;DR
This paper introduces AVR-Eval, a new audio-visual recording-based metric, and AVR-Agent, a multi-agent system for generating interactive multimedia content, demonstrating improved content quality evaluation and generation in video games.
Contribution
The paper presents a novel multi-modal evaluation metric and a multi-agent content generation system that leverages audio-visual recordings for improved multimedia content creation.
Findings
AVR-Eval accurately identifies content quality differences.
AVR-Agent generates higher-quality game content than one-shot methods.
Current models struggle to effectively utilize custom assets and AVR feedback.
Abstract
While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper first proposes a new evaluation metric AVR-Eval that uses audio-visual recordings and omni-modal reasoning models to assess multimedia content. - AVR-Agent introduces a well-designed multi-agent framework for JavaScript-based multimedia content generation by leveraging a bank of multimedia assets. - The combination of AVR-Eval and AVR-Agent provides an effective framework for establishing a closed loop of generation, evaluation, and refinement.
- The benchmark contains only 10 simple tasks (5 animations, 5 games), which are insufficient to support broad claims about model performance or generalization. The proposed method can be further evaluated on complex 3D games and long-term interactive tasks to verify its effectiveness. - The authors only discussed why FVD is unsuitable in the introduction section without providing any comparative experiments. In addition, the paper does not compare AVR-Eval with other common evaluation met
1. Addresses a Critical Bottleneck: The paper tackles a core challenge in generative AI for interactive content: the lack of scalable, automated evaluation. Human-in-the-loop evaluation (like WebDev Arena) is a major bottleneck, and the idea of using an omni-modal model to "watch and listen" to content is a novel and important research direction. 2. Novelty of the Metric Concept: The AVR-Eval metric moves beyond static code analysis or simple screenshot evaluation. By incorporating audio and vi
1. Experimental Circularity: The entire experimental setup is critically circular. The AVR-Agent uses an omni-modal model (Qwen2.5-Omni-7B) to provide feedback for improvement. The AVR-Eval metric then uses the exact same model (Qwen2.5-Omni-7B) to judge the final quality. The agent is, therefore, being optimized to satisfy the biases of its own evaluator. The paper does not demonstrate that the agent is producing objectively better games; it only demonstrates that it is getting better at pleasi
1. Target an interesting goal in achieving automated game design. 2. Each component in the AVR-Eval or the AVR-Agent is evaluated carefully to demonstrate its effectiveness.
1. Overall the paper is very engineering for designing pipeline and prompts for AVR-Eval and AVR-Agent, and lack main technical contribution. 2. Not much related work being discussed in the paper so it's hard to place the paper in existing literature. 3. While AVR‑Eval is intuitive and the ablation is convincing, there is no study of alignment with human raters. 4. The benchmark uses five game and animations. Many results may not carry to richer game loops, content pipelines, or larger engine u
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
