Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
Zhilin Liu, Ye Huang, Ting Xie, Ruizhi Zhang, Wen Li, Lixin Duan

TL;DR
This paper introduces a new benchmark and a vision-feedback system for improving GUI code generation and debugging using visual information, addressing limitations of text-based methods.
Contribution
It presents InteractGUI Bench for evaluating GUI tasks and VF-Coder, a visual feedback system that enhances debugging by perceiving visual cues.
Findings
VF-Coder improves success rate from 21.68% to 28.29%.
VF-Coder raises visual score from 0.4284 to 0.5584.
Benchmark enables fine-grained evaluation of GUI interaction and visual structure.
Abstract
Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle in graphical user interface (GUI) that involve visual information. This is mainly due to two limitations: 1) GUI programs are event-driven, yet existing methods cannot simulate user interactions to trigger GUI element logic 2) GUI programs possess visual attributes, making it difficult for text-based approaches to assess whether the rendered interface meets user needs. To systematically address these challenges, we first introduce InteractGUI Bench, a novel benchmark comprising 984 commonly used real-world desktop GUI application tasks designed for fine-grained evaluation of both interaction logic and visual structure. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
