ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
Rao Fu, Ziyang Luo, Hongzhan Lin, Zhen Ye, Jing Ma

TL;DR
ScratchEval is a new benchmark that assesses large multimodal models' ability to understand and reason about visual programming tasks using Scratch, addressing limitations of previous image-to-code evaluations.
Contribution
The paper introduces ScratchEval, a comprehensive benchmark for evaluating LMMs' visual programming reasoning using Scratch, combining visual understanding with code logic.
Findings
LMMs struggle with integrated visual and logical reasoning tasks.
ScratchEval reveals gaps in current multimodal models' programming understanding.
Benchmark encourages development of models with better logical and visual integration capabilities.
Abstract
Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning · Machine Learning and Data Classification
