Beyond the Training Distribution: Mapping Generalization Boundaries in Neural Program Synthesis
Henrik Voigt, Michael Habeck, Joachim Giesen

TL;DR
This paper introduces a controlled environment for evaluating neural program synthesis models, revealing their limitations in out-of-distribution generalization and emphasizing the importance of training diversity.
Contribution
It presents a novel, interpretable framework for assessing generalization boundaries in neural program synthesis using a controlled grammar-based environment.
Findings
Transformers struggle with syntactic extrapolation, with over 30% performance drop.
Diverse sampling over semantic and syntactic spaces improves out-of-distribution generalization.
Scaling compute yields log-linear improvements, but not enough for robust generalization.
Abstract
Large-scale transformers achieve impressive results on program synthesis benchmarks, yet their true generalization capabilities remain obscured by data contamination and opaque training corpora. To rigorously assess whether models are truly generalizing or merely retrieving memorized templates, we introduce a strictly controlled program synthesis environment based on a domain-specific arithmetic grammar. By systematically enumerating and evaluating millions of unique programs, we construct interpretable syntactic and semantic metric spaces. This allows us to precisely map data distributions and sample train and test splits that isolate specific distributional shifts. Our experiments demonstrate that optimizing density generalization -- through diverse sampling over both semantic and syntactic spaces -- induces robust out-of-distribution generalization. Conversely, evaluating support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
