Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs
Ananya Singha, Harshita Sahijwani, Walt Williams, Emmanuel Aboah Boateng, Nick Hausman, Miguel Di Luca, Keegan Choudhury, Chaya Binet, Vu Le, Tianwei Chen, Oryan Rokeah Chen, Sulaiman Vesal, Sadid Hasan

TL;DR
This paper introduces a new benchmark dataset of 618 high-quality Excel formula repair samples created through a novel data generation pipeline using LLMs, enabling better evaluation and development of formula correction models.
Contribution
The paper presents a scalable data generation pipeline leveraging LLMs and validation frameworks to create a high-quality benchmark dataset for Excel formula repair, addressing a key resource gap.
Findings
The dataset covers common runtime errors in Excel formulas.
Evaluation of various LLMs shows GPT-4 variants perform best on the benchmark.
The data generation approach is adaptable to other low-resource code repair tasks.
Abstract
Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer promising assistance by explaining formula errors, the automated correction of these semantic runtime errors remains an open problem. A primary challenge to advancing models for such scenarios is the severe lack of high-quality, comprehensive datasets for training and rigorous evaluation. This paper addresses this gap by introducing a novel approach for constructing a benchmark dataset specifically designed for Excel formula repair. We propose a data generation pipeline, which leverages a small set of curated seed samples from online forums to synthetically expand the dataset. Our pipeline integrates few-shot prompting with LLMs and employs a robust…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The paper identifies a genuine gap in semantic formula repair for spreadsheets, which differs significantly from syntax-only program repair tasks in structure and context dependency. By focusing on runtime errors and spreadsheet semantics, it opens a practically relevant and novel research direction. - The combination of manual seed verification, execution validation, and LLM-judge filtering ensures the generated dataset’s correctness and consistency. The detailed curation and verification pi
- While the dataset fills an important gap, its final size (618 samples) remains small relative to the diversity of real-world Excel usage. Moreover, the samples tend to be simpler than genuine user-generated errors, limiting the dataset’s stress-testing potential. - Overreliance on GPT-based validation may bias results. Because both dataset generation and evaluation rely on GPT-4 variants (e.g., GPT-4o as generator and GPT-4/4.1 as baselines), the benchmark may be inadvertently tuned to GPT’s
- The paper addresses a significant gap in the literature. While LLMs have shown promise in code generation and repair for general-purpose languages, their application to spreadsheet formula repair has been underexplored. - The inclusion of context (table data, headers) and user intent (natural language utterance) is crucial for modeling realistic repair scenarios, moving beyond purely syntactic fixes.
- **Baseline Method Simplicity:** The proposed baseline repair technique, while context-aware, is essentially a single-prompt engineering approach. It doesn't introduce a novel algorithmic or architectural contribution for repair. - **Scalability Claim:** The paper claims the methodology is "highly scalable." However, the process relies heavily on a manually curated seed set (59 samples after rigorous manual filtering of forum posts). The scalability of the entire pipeline is therefore continge
* The proposed benchmark considers semantic correctness, which is an important aspect but overlooked by previous works in this domain. * The curation of the seed dataset is sound and rigorous. * Experiments and analysis for the benchmarking dataset are comprehensive and inspiring.
* The seed dataset is a more reliable evaluation set, albeit having a small size. In contrast, the quality of the bootstrapped dataset is concerning. To explain, there is an LLM *examiner* who generates the problem along with a reference answer, and an LLM *examinee* who attempts to solve it. If we only think about generating the answer part, with the same underlying model, the *examiner* has no advantage over the *examinee*, except for the 1-shot demonstration (which is ideally not useful for s
- It is an interesting topic and promises to contribute to this area. - Synthetic dataset is an important research direction. - The method is straightforward to follow.
- While the proposed synthetic data generation pipeline is creative, it raises concerns about potential bias and limited generalization. Starting from only 50 manually curated seed samples—sourced exclusively from a single forum (MrExcel), which is both small and unrepresentative—and expanding them via one-shot prompting (GPT-4o, temperature 0.64) risks severe mode collapse and hallucination. In such settings, the LLM is likely to replicate seed patterns mechanically rather than capture the true
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
