Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

Zergham Ahmed; Kazuki Irie; Joshua B. Tenenbaum; Christopher J. Bates; Samuel J. Gershman

arXiv:2602.00929·cs.AI·February 3, 2026

Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

Zergham Ahmed, Kazuki Irie, Joshua B. Tenenbaum, Christopher J. Bates, Samuel J. Gershman

PDF

Open Access 3 Reviews

TL;DR

TheoryCoder-2 advances hierarchical planning in program-synthesis agents by enabling LLMs to learn abstractions from experience, significantly improving sample efficiency and problem-solving capabilities across diverse environments.

Contribution

It introduces a novel TBRL agent that actively learns abstractions using LLMs, reducing reliance on human-provided abstractions and enhancing generalization in hierarchical planning.

Findings

01

TheoryCoder-2 outperforms baseline agents in sample efficiency.

02

It solves complex tasks that prior methods cannot.

03

Requires minimal human prompts for abstraction learning.

Abstract

Humans learn abstractions and use them to plan efficiently to quickly generalize across tasks -- an ability that remains challenging for state-of-the-art large language model (LLM) agents and deep reinforcement learning (RL) systems. Inspired by the cognitive science of how people form abstractions and intuitive theories of their world knowledge, Theory-Based RL (TBRL) systems, such as TheoryCoder, exhibit strong generalization through effective use of abstractions. However, they heavily rely on human-provided abstractions and sidestep the abstraction-learning problem. We introduce TheoryCoder-2, a new TBRL agent that leverages LLMs' in-context learning ability to actively learn reusable abstractions rather than relying on hand-specified ones, by synthesizing abstractions from experience and integrating them into a hierarchical planning process. We conduct experiments on diverse…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The paper focuses on an interesting and timely direction: automatically learning reusable abstractions with LLMs. - The presentation is clear and the paper is generally easy to follow.

Weaknesses

- The method resembles prior work on learning and reusing skills through curriculum learning and code-based representations. The main difference here is expressing these skills in the PDDL format, but similar high-level ideas (e.g., learning, storing, and recalling code snippets or functions for reuse), have been explored in many recent works such as [1]. However, these related baselines are not discussed in sufficient depth, which makes it hard to assess what is genuinely new beyond the specifi

Reviewer 02Rating 4Confidence 3

Strengths

- The paper is well-written and very easy to follow. It introduces a domain, the task, and method in a way that is friendly to the broader RL/ML community. - The improvement from TheoryCoder to TheoryCoder2 is significant. It gets rid of domain specific human annotation, which shows potential to scale to more complex, unseen environments - Using programs as a representation of complex environments and their transitions is interesting and shows a direction of world model.

Weaknesses

- Though the modeling through programs works well on these environments, I am unsure about how the TBRL paradigm and TheoryCoder2 will perform on more complex, real-world tasks. Recent advancements on Agent Task have proposed more complex tasks e.g., ALFWorld, Webshop, WebArena. I doubt that this paradigm and method can still discover and model the transition function. This is because purely modeling the very complex environment with code is difficult (if not impossible). The authors may want to

Reviewer 03Rating 4Confidence 4

Strengths

- The paper addresses a limitation in TBRL and many program-synthesis agents (dependency on human-input). The proposed approach is an interesting solution to this issue. - To the best of my knowledge, the combination of (i) PDDL abstraction synthesis via in-context LLM prompting, (ii) creation of a curriculum-based abstraction collection, and (iii) grounding through low-level Python models is novel. - The idea fits ICLR.

Weaknesses

To me, the main weakness of the paper is the scope of the experimental analysis. The experiments are not sufficient for a strong, general conclusion that TheoryCoder-2 achieves scalable, human-like abstraction learning. There is only one curriculum with eight problems evaluated in the experimental results, so I am not confident taking conclusions from it. Usually, I think that good ideas can overcome bad experimental results, but in the case of this paper, the main motivation is to show that t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Artificial Intelligence in Games · Reinforcement Learning in Robotics