Benchmarking Generative Models on Computational Thinking Tests in   Elementary Visual Programming

Victor-Alexandru P\u{a}durean; Adish Singla

arXiv:2406.09891·cs.AI·March 19, 2025

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

Victor-Alexandru P\u{a}durean, Adish Singla

PDF

Open Access 1 Video

TL;DR

This paper introduces a new benchmark for evaluating generative models on elementary-level visual programming tests, revealing current models' limitations and proposing a synthetic data fine-tuning approach to improve their problem-solving skills.

Contribution

The paper presents a novel benchmark for computational thinking in elementary visual programming and a synthetic data generation method to enhance model performance.

Findings

01

State-of-the-art models barely match average students' performance.

02

Synthetic data fine-tuning improves model problem-solving skills.

03

Benchmark and datasets will be publicly released for further research.

Abstract

Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized programming-related tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming· slideslive

Taxonomy

TopicsTeaching and Learning Programming