Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Ananya Singha; Harshita Sahijwani; Walt Williams; Emmanuel Aboah Boateng; Nick Hausman; Miguel Di Luca; Keegan Choudhury; Chaya Binet; Vu Le; Tianwei Chen; Oryan Rokeah Chen; Sulaiman Vesal; Sadid Hasan

arXiv:2508.11715·cs.SE·August 19, 2025

Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a new benchmark dataset of 618 high-quality Excel formula repair samples created through a novel data generation pipeline using LLMs, enabling better evaluation and development of formula correction models.

Contribution

The paper presents a scalable data generation pipeline leveraging LLMs and validation frameworks to create a high-quality benchmark dataset for Excel formula repair, addressing a key resource gap.

Findings

01

The dataset covers common runtime errors in Excel formulas.

02

Evaluation of various LLMs shows GPT-4 variants perform best on the benchmark.

03

The data generation approach is adaptable to other low-resource code repair tasks.

Abstract

Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer promising assistance by explaining formula errors, the automated correction of these semantic runtime errors remains an open problem. A primary challenge to advancing models for such scenarios is the severe lack of high-quality, comprehensive datasets for training and rigorous evaluation. This paper addresses this gap by introducing a novel approach for constructing a benchmark dataset specifically designed for Excel formula repair. We propose a data generation pipeline, which leverages a small set of curated seed samples from online forums to synthetically expand the dataset. Our pipeline integrates few-shot prompting with LLMs and employs a robust…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 2

Strengths

- The paper identifies a genuine gap in semantic formula repair for spreadsheets, which differs significantly from syntax-only program repair tasks in structure and context dependency. By focusing on runtime errors and spreadsheet semantics, it opens a practically relevant and novel research direction. - The combination of manual seed verification, execution validation, and LLM-judge filtering ensures the generated dataset’s correctness and consistency. The detailed curation and verification pi

Weaknesses

- While the dataset fills an important gap, its final size (618 samples) remains small relative to the diversity of real-world Excel usage. Moreover, the samples tend to be simpler than genuine user-generated errors, limiting the dataset’s stress-testing potential. - Overreliance on GPT-based validation may bias results. Because both dataset generation and evaluation rely on GPT-4 variants (e.g., GPT-4o as generator and GPT-4/4.1 as baselines), the benchmark may be inadvertently tuned to GPT’s

Reviewer 02Rating 4Confidence 4

Strengths

- The paper addresses a significant gap in the literature. While LLMs have shown promise in code generation and repair for general-purpose languages, their application to spreadsheet formula repair has been underexplored. - The inclusion of context (table data, headers) and user intent (natural language utterance) is crucial for modeling realistic repair scenarios, moving beyond purely syntactic fixes.

Weaknesses

- **Baseline Method Simplicity:** The proposed baseline repair technique, while context-aware, is essentially a single-prompt engineering approach. It doesn't introduce a novel algorithmic or architectural contribution for repair. - **Scalability Claim:** The paper claims the methodology is "highly scalable." However, the process relies heavily on a manually curated seed set (59 samples after rigorous manual filtering of forum posts). The scalability of the entire pipeline is therefore continge

Reviewer 03Rating 4Confidence 3

Strengths

* The proposed benchmark considers semantic correctness, which is an important aspect but overlooked by previous works in this domain. * The curation of the seed dataset is sound and rigorous. * Experiments and analysis for the benchmarking dataset are comprehensive and inspiring.

Weaknesses

* The seed dataset is a more reliable evaluation set, albeit having a small size. In contrast, the quality of the bootstrapped dataset is concerning. To explain, there is an LLM *examiner* who generates the problem along with a reference answer, and an LLM *examinee* who attempts to solve it. If we only think about generating the answer part, with the same underlying model, the *examiner* has no advantage over the *examinee*, except for the 1-shot demonstration (which is ideally not useful for s

Reviewer 04Rating 2Confidence 3

Strengths

- It is an interesting topic and promises to contribute to this area. - Synthetic dataset is an important research direction. - The method is straightforward to follow.

Weaknesses

- While the proposed synthetic data generation pipeline is creative, it raises concerns about potential bias and limited generalization. Starting from only 50 manually curated seed samples—sourced exclusively from a single forum (MrExcel), which is both small and unrepresentative—and expanding them via one-shot prompting (GPT-4o, temperature 0.64) risks severe mode collapse and hallucination. In such settings, the LLM is likely to replicate seed patterns mechanically rather than capture the true

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing