TL;DR
PlayCoder introduces a multi-agent framework that enhances the generation and correctness of GUI applications by iteratively repairing code, addressing the limitations of current LLMs in producing logically consistent GUI programs.
Contribution
The paper presents PlayCoder, a novel framework that improves LLM-generated GUI code through iterative repair and evaluation, supported by a new benchmark and evaluation metric.
Findings
LLMs achieve near-zero Play@3 on GUI tasks without repair.
PlayCoder significantly improves correctness, reaching up to 38.1% Exec@3.
Traditional metrics miss silent logic bugs that PlayCoder can detect and fix.
Abstract
Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
