See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch
Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang

TL;DR
This paper introduces ScratchWorld, a comprehensive benchmark for evaluating multimodal GUI agents in Scratch, highlighting the challenges in visuomotor control and reasoning in low-code educational environments.
Contribution
The paper presents ScratchWorld, a novel benchmark with diverse tasks and evaluation modes to assess multimodal GUI agents in Scratch programming environments.
Findings
Significant reasoning-acting gap identified in current models.
Fine-grained visuomotor control remains a major challenge.
Benchmark enables detailed diagnosis of agent failures.
Abstract
Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming · Artificial Intelligence in Games · Intelligent Tutoring Systems and Adaptive Learning
