PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics
Atharva Naik, Prakam, Yash Mathur, Darsh Agrawal, Manav Kapadnis, Yuwei An, Clayton Marr, Carolyn Rose, David Mortensen

TL;DR
This paper introduces PBEBench, a novel benchmark inspired by historical linguistics that evaluates the inductive reasoning capabilities of large language models through a multi-step programming by examples task, highlighting current models' limitations.
Contribution
It presents a scalable, automated pipeline for generating reasoning problems and two benchmark datasets, advancing the evaluation of LLMs' reasoning skills beyond domain-specific tasks.
Findings
Models with test-time compute or LCoT reasoning outperform others.
Recent models solve less than 5% of hard instances.
Scaling strategies and hyperparameters significantly affect difficulty.
Abstract
Although many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs) within domains such as mathematics, coding, or data wrangling, few abstract away from domain specifics to examine reasoning as a capability in and of itself. We contribute a novel type of benchmark evaluating the inductive reasoning capabilities of LLMs that is inspired by the forward reconstruction task from historical linguistics but is formulated in an extremely simple, general way (in the form of Programming by Examples). The task involves generating a cascade of simple string rewrite programs to transform a given list of input strings into a list of desired output strings. We present a fully automated pipeline that programmatically generates problems of this type with controllable difficulty, enabling scalable evaluation of reasoning models while avoiding contamination. Using this approach, we…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Generally I find the paper well written, with good structure, clear flow, and all steps are given in sufficient detail that I would be able to reproduce their results. - I find the formulation of the benchmark novel - many datasets present reasoning tasks that require multiple stages, but this offers a completely synthetic task grounded in historic linguistics. This is aesthetically pleasing, and offers a conceptually different problem to automated maths / logic benchmarks - The eval
- It is unclear how the dataset guarantees that each of the problems are genuinely solvable - It is unclear if the models are getting the right answer by following the correct set of reasoning steps - it could be partially using correct substitutions and partly de-noising - an important and powerful feature, but not necessarily the property the authors are trying to evaluate here. I would like to see some measure of how well the task path is followed? - There don’t seem to be baselines
1. The formulation of the proposed benchmark and the corresponding data generation process is structured and well-grounded in formal definitions. 2. An alternative complexity measure based on the relation types between rule pairs presents an interesting approach, as it captures categorically different forms of compositionality. This diverges from the conventional practice of using composition length (e.g., cascade length) as the sole indicator of complexity. 3. The experiments cover a broad rang
1. I find the benchmark’s novelty as an evaluation measure for LLM reasoning somewhat unclear. The five distinctions listed in Section 1 appear conceptually overlapping with existing work or not fully substantiated in their current form: a. **Domain-agnostic design** It is described as domain-agnostic and independent of non-trivial domain knowledge. However, similar synthetic, domain-neutral reasoning benchmarks have been explored in prior work (e.g., [1,2,3,4]), with [2] als
- The dataset is timely and well-motivated, addressing the need for evaluating and improving inductive reasoning in LLMs. - The proposed benchmark effectively enables scalable control over reasoning task complexity. - The writing is clear, and the paper is well structured overall.
- The overall task is rather narrow in scope, raising questions about its usefulness for downstream applications. - The lack of experiments showing whether training on PBEBench improves inductive reasoning capabilities within their dataset and downstream strongly limits the significance of the contribution. - While the paper is well organized and generally clear, Section 3 is very dense and hard to follow. More illustrative examples would help to guide the reader through your paper.
* Many reasoning benchmarks emphasize deduction; PBEBench explicitly targets inductive program induction, filling a notable gap. * The problem setup is unique. Bridging forward reconstruction and PBE with a minimal rewrite DSL is conceptually fresh and intellectually appealing. * Comprehensive evaluation covers diverse LLMs (reasoning vs. non‑reasoning; open vs. closed), a wide range of difficulty levels PBEBench‑Lite through PBEBench with cascades up to 30, and multiple metrics (Pass@1, Edit_S
* The overall setup is not immediately understandable to newcomers. In Figure 1, it would help to state explicitly what the problem proposer takes as input/outputs and what the LLM solver receives/returns.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · AI-based Problem Solving and Planning · Speech and dialogue systems
MethodsFocus · Sparse Evolutionary Training
