InnateCoder: Learning Programmatic Options with Foundation Models

Rubens O. Moraes; Quazi Asif Sadmine; Hendrik Baier; Levi H. S. Lelis

arXiv:2505.12508·cs.LG·May 20, 2025

InnateCoder: Learning Programmatic Options with Foundation Models

Rubens O. Moraes, Quazi Asif Sadmine, Hendrik Baier, Levi H. S. Lelis

PDF

Open Access 1 Repo 3 Reviews

TL;DR

InnateCoder leverages foundation models to encode human knowledge as options, enabling zero-shot learning of programmatic policies that improve sampling efficiency in reinforcement learning tasks.

Contribution

It introduces a novel method to learn options from foundation models without environment interaction, enhancing policy learning efficiency.

Findings

01

InnateCoder outperforms baseline methods in sample efficiency.

02

Options learned from foundation models improve learning speed.

03

Empirical validation in MicroRTS and Karel the Robot.

Abstract

Outside of transfer learning settings, reinforcement learning agents start their learning process from a clean slate. As a result, such agents have to go through a slow process to learn even the most obvious skills required to solve a problem. In this paper, we present InnateCoder, a system that leverages human knowledge encoded in foundation models to provide programmatic policies that encode "innate skills" in the form of temporally extended actions, or options. In contrast to existing approaches to learning options, InnateCoder learns them from the general human knowledge encoded in foundation models in a zero-shot setting, and not from the knowledge the agent gains by interacting with the environment. Then, InnateCoder searches for a programmatic policy by combining the programs encoding these options into larger and more complex programs. We hypothesized that InnateCoder's way of…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- Writing overall is fine, although I have some small complaints which I'll mention in the weakness section. - I believe this method is novel to my knowledge. - InnateCoder was evaluated in two domains used in previous methods, showing nontrivial improvement over baselines in a number of tasks while matching performances in the rest. - I appreciate the authors' effort to eliminate data leakage as a factor during evaluation.

Weaknesses

- I find the presentation of this paper to be a little confusing for someone not familiar with the prior work. For example, in section 2, the authors start to talk about the pros and cons of DSL before giving an introduction or even a definition of DSL. I also think using the actual language (or a simplified version) in one of the tested domains, instead of the generic "if b then c" or "c1" "c2" as the example could be more helpful. - Could the authors specify what are the exact differences betw

Reviewer 02Rating 6Confidence 3

Strengths

- Authors do a nice job of explaining the algorithm. - Authors also did a nice job of providing the DSL for each environment and the exact prompts used.

Weaknesses

The main weakness here is that the contribution seems to be small. It seems likely that the better use of a LLM is to learn a policy that outputs short programs conditioned on the history of the game that has occurred. This way the foundation model can adjust its output as it gets more information on how the game works. The reasoning provided for not testing this is that this would be expensive. But if you were to apply this algorithm from scratch to a new environment, it would already be co

Reviewer 03Rating 6Confidence 3

Strengths

This paper works on a very interesting direction of using pretrained LLMs to synthesize agent policies in RL tasks. This is important, as we still lack any meaningful foundation models for agents (here understood as action-taking systems), and this seems to be one good way of using the "general knowledge machines" for some agentic behavior. The results obtained with this approach seem very good, and claim SOTA - I'm not sufficiently familiar with this specific domain to evaluate this claim, bu

Weaknesses

My only main concern is about the fairness of comparison in the MicroRTS evaluation, possibly due to not understanding this part of the paper. In Figure 3, you plot the winrate. The winrate is defined on line 341 (somewhat confusingly under "Other specifications") as "The winning rate of a policy is computed for a set of opponent policies [...]" - what policies? Is it equally sampled between COAC, Mayari and RAISocketAI? More importantly, over the course of the training in Figure 3 (the one tha

Code & Models

Repositories

rubensolv/InnateCoder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Artificial Intelligence in Games · Domain Adaptation and Few-Shot Learning