Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning

Simon Ouellette

arXiv:2507.15877·cs.AI·September 23, 2025

Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning

Simon Ouellette

PDF

Open Access 3 Reviews

TL;DR

This paper compares execution-guided neural program synthesis and test-time fine-tuning in the ARC-AGI domain, demonstrating that program synthesis excels at out-of-distribution generalization, with fine-tuning mainly leveraging in-distribution knowledge.

Contribution

It provides a controlled experiment showing the superior out-of-distribution generalization of execution-guided neural program synthesis over test-time fine-tuning in the ARC-AGI domain.

Findings

01

Execution-guided neural program synthesis outperforms reference algorithms.

02

Test-time fine-tuning mainly relies on in-distribution knowledge.

03

Neural program synthesis better composes novel solutions.

Abstract

We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. Compositional generalization is an important yet underexplored problem. The setup the authors considered in ARC-AGI will be generally useful for evaluating compositional generality of future approaches. 2. The authors demonstrate that fine-tuning is primarily useful for unlocking latent in-distribution knowledge and not OOD generalization. This is an interesting finding as such fine-tuning approaches are very popular in practice. 3. The authors develop (to my knowledge) the first instance

Weaknesses

1. While execution-guided program synthesis is proven to be effective at compositional generalization, its applicability seems to be limited due to the requirement of designing a DSL that covers all theoretically solvable tasks. In particular, the DSL the authors designed does not cover the full scope of ARG-AGI tasks. So it is unclear how useful such approaches will be in general. 2. The entropy term added to search did not seem particularly effective. It is unclear if the search algorithm im

Reviewer 02Rating 4Confidence 3

Strengths

* The paper is well written and explains enough details to understand the distinction to prior approaches. * The ablation study is focused and explains where EG-NPS improves over prior works evaluated. * The gains on the programs evaluated in success rate are significant. The approach solves about 83% of the few (1.5%) of ARC-AGI-2 tasks within scope of the current DSL considered in the work. The paper's proposed methods combines a number of pre-existing ideas, but in a systematic way. The an

Weaknesses

* The paper is relatively weak on explaining how the expressiveness of the DSL is a stumbling block to a broader evaluation. The paper briefly mentions in Section 6.4 that the DSL had to expanded to deal with just 1.5% of ARC-AGI-2 benchmarks. It is known that the search space explodes and there is work characterizing the input-output samples needed therein. Please see the work by Wang et al. [a]. It will be useful if the paper talks about the issue of DSL design, state explosion, and expected s

Reviewer 03Rating 4Confidence 3

Strengths

The paper is overall well-written: - Focus on compositional and OOD generalization. - Novel tree search algorithm for execution-guided neural program synthesis. - Diverse empirical setups: include baselines and ablations.

Weaknesses

The paper’s main weaknesses include scale, scope, and generalizability of the experiments: - Requires ground-truth programs as training data. - Requires a DSL, and the implemented one is very limiting. - Limited scale and empirical scope: for instance, experiments on ARC-AGI include only 18 tasks out of the 1000+ available. - Other fully neural baselines could have been tested against: e.g., [Ouellette, 2024] or [Bonnet & Macfarlane, 2024].

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning