You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation
Yutong Bian, Xianhao Lin, Yupeng Xie, Tianyang Liu, Mingchen Zhuge, Siyuan Lu, Haoming Tang, Jinlin Wang, Jiayi Zhang, Jiaqi Chen, Xiangru Tang, Yongxin Ni, Sirui Hong, Chenglin Wu

TL;DR
This paper introduces RealDevWorld, an evaluation framework that uses GUI-based interactions and diverse tasks to automatically assess the quality of software generated by LLMs, capturing real-world usability.
Contribution
It presents a novel, automated, end-to-end evaluation system combining diverse tasks and GUI interaction simulation to assess production-ready software from LLMs.
Findings
Achieves 0.92 accuracy in automatic evaluation
Correlates 0.85 with human assessments
Reduces manual review in software evaluation
Abstract
Large Language Models (LLMs) and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability - qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: you don't know if an app works until you click through it, interact with it, and observe how it responds. To bridge this gap, we introduce RealDevWorld, a novel evaluation framework for automated end-to-end assessment of LLMs' ability to generate production-ready repositories from scratch.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Real-time simulation and control systems · Software System Performance and Reliability
