See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Xingyi Zhang; Yulei Ye; Kaifeng Huang; Wenhao Li; Xiangfeng Wang

arXiv:2602.10814·cs.AI·February 12, 2026

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang

PDF

Open Access

TL;DR

This paper introduces ScratchWorld, a comprehensive benchmark for evaluating multimodal GUI agents in Scratch, highlighting the challenges in visuomotor control and reasoning in low-code educational environments.

Contribution

The paper presents ScratchWorld, a novel benchmark with diverse tasks and evaluation modes to assess multimodal GUI agents in Scratch programming environments.

Findings

01

Significant reasoning-acting gap identified in current models.

02

Fine-grained visuomotor control remains a major challenge.

03

Benchmark enables detailed diagnosis of agent failures.

Abstract

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming · Artificial Intelligence in Games · Intelligent Tutoring Systems and Adaptive Learning