A Theoretical Analysis of Test-Driven Code Generation
Nicolas Menet, Michael Hersche, Andreas Krause, Abbas Rahimi

TL;DR
This paper develops a probabilistic framework for test-driven code generation, analyzing environment-interaction strategies and providing theoretical insights into their effectiveness and limitations.
Contribution
It formalizes selection heuristics and backprompting, deriving bounds and biases, and validates findings with experiments on state-of-the-art models and benchmarks.
Findings
Estimators based on fuzzy similarity outperform those based on functional equivalence.
Backprompting is an in-context approximation of Thompson sampling with limited effectiveness.
A new benchmark, QiskitHumanEvalSimX, is proposed to improve task descriptions.
Abstract
Code assistants are increasingly utilized in test-driven software development, yet the theoretical mechanisms behind their environment-interaction strategies remain underexplored. We provide a probabilistic framework for two dominant paradigms: code selection after generation using the execution environment, and code generation conditioned on environment feedback. First, we formalize several well-established selection heuristics as environment-aware estimators of code correctness. We theoretically prove that estimators based on fuzzy functional similarity add an inductive bias and strictly dominate estimators based on functional equivalence in terms of signal-to-noise ratio. Second, we frame backprompting as an in-context approximation of Thompson sampling. We derive a novel regret bound for reward functions with unobservable components, theoretically explaining why the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
