Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Chao Wen; Jacqueline Staub; Adish Singla

arXiv:2406.11334·cs.AI·October 7, 2025

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

Chao Wen, Jacqueline Staub, Adish Singla

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces a new program synthesis benchmark in the XLogoOnline environment, evaluates current models' performance, and proposes fine-tuning methods to significantly improve results on complex visual programming tasks.

Contribution

It presents a novel benchmark combining multiple skills, evaluates state-of-the-art models, and develops a fine-tuning pipeline with curriculum learning to enhance model performance.

Findings

01

GPT-4V achieves 20% success rate

02

Llama3-70B achieves 2.35% success rate

03

Fine-tuning with synthetic data and curriculum improves performance

Abstract

Large language and multimodal models have shown remarkable success on various benchmarks focused on specific skills such as general-purpose programming, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the real-world tasks in the XLogoOnline visual programming environment. Each task requires a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates, respectively. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. This paper identifies a new task that current SotA models fail on.

Weaknesses

1. This paper provides few new insights about identifying models' limitations: This paper shows that existing models fail in visual programming in XLogoOnline. But this task is similar to some tasks in works like MMMU, which also combine different skills like vision and reasoning and have already shown that current models struggle in visual reasoning tasks 2. This paper also lacks novelty about improving models' performance: 1. The data synthesis method is based on sampling and filtration,

Reviewer 02Rating 5Confidence 3

Strengths

It introduces XLOGOMINIPROG, a new benchmark that tests multiskill tasks in visual programming, an area where current models perform poorly. It demonstrates the effectiveness of emulator-driven feedback for designing a dynamic training curriculum, further boosting model performance. Detailed experiments were carried out

Weaknesses

The emulator-driven fine-tuning provides only binary correctness feedback on the predicted code, which might not be sufficient for identifying and correcting specific errors in the generated code. Can it be in natural language? There's a risk that models could overfit to the synthetic data used in training, which might not perfectly mimic the complexity and variability of real-world programming challenges. I think this is the most serious disadvantage

Reviewer 03Rating 3Confidence 5

Strengths

1. Novel Programming Generation Evaluation: The evaluation of programming generation based on the XLogoOnline visual programming environment is somewhat novel, as it integrates vision, language, and coding skills to assess LLMs. 2. Well-Structured Paper Organization: The paper is well-organized, starting with an introduction to the benchmark, followed by a detailed presentation of the training data, and concluding with an evaluation of different performance metrics and analyses.

Weaknesses

1. Lack of Task Challenge: The task presented in this paper appears to lack sufficient challenge, as evidenced by Table 6. Although the zero-shot performance of both closed and open-source models is low, simply synthesizing a large amount of training data can significantly boost performance. This suggests that the task is not particularly challenging. The poor performance of current models can be attributed to their lack of training with this specific data. A more detailed analysis of task diffi

Code & Models

Datasets

machine-teaching-group/XLogoMiniProg
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Teaching and Learning Programming