Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming
Victor-Alexandru P\u{a}durean, Adish Singla

TL;DR
This paper introduces a new benchmark for evaluating generative models on elementary-level visual programming tests, revealing current models' limitations and proposing a synthetic data fine-tuning approach to improve their problem-solving skills.
Contribution
The paper presents a novel benchmark for computational thinking in elementary visual programming and a synthetic data generation method to enhance model performance.
Findings
State-of-the-art models barely match average students' performance.
Synthetic data fine-tuning improves model problem-solving skills.
Benchmark and datasets will be publicly released for further research.
Abstract
Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized programming-related tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTeaching and Learning Programming
