Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation
Donglin Li, Daming Li, Hanyuan Shi, Jialu Zhang

TL;DR
Raven introduces a novel automated assessment framework for Scratch programs that leverages video analysis and large language models to evaluate student work based on task-level behaviors, improving scalability and accuracy.
Contribution
The paper presents Raven, a new assessment system that replaces brittle test cases with video-grounded evaluation rules, enabling consistent grading across diverse Scratch implementations.
Findings
Raven outperforms prior tools in grading accuracy and robustness.
The system is effective across diverse programming styles and interaction sequences.
Classroom study shows high user acceptance and practical utility.
Abstract
Block-based programming environments such as Scratch are widely used in introductory computing education, yet scalable and reliable automated assessment remains elusive. Scratch programs are highly heterogeneous, event-driven, and visually grounded, which makes traditional assertion-based or test-based grading brittle and difficult to scale. As a result, assessment in real Scratch classrooms still relies heavily on manual inspection and delayed feedback, introducing inconsistency across instructors and limiting scalability. We present Raven, an automated assessment framework for Scratch that replaces program-specific state assertions with instructor-specified, task-level video generation rules shared across all student submissions. Raven integrates large language models with video analysis to evaluate whether a program's observed visual and interactive behaviors satisfy grading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
