ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large   Multimodal Models with Visual Programming Challenges

Rao Fu; Ziyang Luo; Hongzhan Lin; Zhen Ye; Jing Ma

arXiv:2411.18932·cs.CL·December 2, 2024

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

Rao Fu, Ziyang Luo, Hongzhan Lin, Zhen Ye, Jing Ma

PDF

Open Access 1 Repo 1 Video

TL;DR

ScratchEval is a new benchmark that assesses large multimodal models' ability to understand and reason about visual programming tasks using Scratch, addressing limitations of previous image-to-code evaluations.

Contribution

The paper introduces ScratchEval, a comprehensive benchmark for evaluating LMMs' visual programming reasoning using Scratch, combining visual understanding with code logic.

Findings

01

LMMs struggle with integrated visual and logical reasoning tasks.

02

ScratchEval reveals gaps in current multimodal models' programming understanding.

03

Benchmark encourages development of models with better logical and visual integration capabilities.

Abstract

Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkbunlp/scratcheval
noneOfficial

Videos

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning · Machine Learning and Data Classification