ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming
Yuan Si, Simeng Han, Daming Li, Hanyuan Shi, Jialu Zhang

TL;DR
ScratchEval is a comprehensive benchmark designed to evaluate the ability of large language models to understand, debug, and repair Scratch programs, addressing the unique challenges of block-based, event-driven programming.
Contribution
This work introduces the first executable benchmark for LLMs on Scratch, including a detailed evaluation protocol and curated dataset focusing on complex, real-world projects.
Findings
LLMs show limited performance on Scratch-specific tasks
Fine-tuning improves repair accuracy but still faces challenges
Benchmark enables systematic assessment of model understanding and repair quality
Abstract
LLMs have achieved strong performance on text-based programming tasks, yet they remain unreliable for block-based languages such as Scratch. Scratch programs exhibit deeply nested, non-linear structures, event-driven concurrency across multiple sprites, and tight coupling between code and multimedia assets, properties that differ fundamentally from textual code. As a result, LLMs often misinterpret Scratch semantics and generate large, invasive edits that are syntactically valid but semantically incorrect when repairing buggy programs. We introduce ScratchEval, the first executable benchmark designed to evaluate LLM-based repair for Scratch programs, covering program understanding, debugging, analysis, and repair. The benchmark contains 100 curated Scratch projects from the public repository, selected for structural and semantic complexity. Each project is paired with executable test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Logic, programming, and type systems
