Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook; Silvia Sapora; Arash Ahmadian; Akbir Khan; Tim Rocktaschel; Jakob Foerster; Laura Ruis

arXiv:2506.18777·cs.AI·February 25, 2026

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs

Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis

PDF

3 Reviews

TL;DR

The paper introduces Programming by Backprop (PBB), a training method that enables large language models to learn procedural behaviors from declarative instructions, significantly reducing the number of examples needed for effective learning.

Contribution

PBB is a novel training regime that separates instruction-to-behavior mapping from instruction internalization, improving sample efficiency in LLM training.

Findings

01

PBB achieves high sample efficiency, with one instruction replacing up to 100 examples.

02

Controlled experiments show PBB outperforms homogeneous data training.

03

Procedural knowledge can be noisily embedded into LLMs through PBB.

Abstract

Large language models (LLMs) are typically trained to acquire behaviours from demonstrations or experience, yet much of their training data is declarative: instructions, rules, and descriptions that specify behaviours without showing how to execute them. We introduce Programming by Backprop (PBB): a training regime that enables LLMs to acquire procedural knowledge (i.e., reusable behaviours) from declarative instructions encountered during training. With PBB, instructions in training data provide an opportunity to `program' specific behaviours into model weights. The core principle underpinning PBB is the separation of learning how instructions map to behaviour from internalising new instructions. We devise two distinct PBB curricula that leverage this principle. Through controlled experiments across two domains (algorithmic execution from Python source code and text generation from…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Timely problem & good problem framing. The paper is very in tune with where LLM serving is going: disaggregation (Splitwise, DistServe, Mooncake), MoE, and separate RL rollout fleets all want *non-collective*, *asymmetric*, *sometimes sparse* transfers. The authors clearly articulate where collectives fall short (fixed membership, ordering, shape uniformity), and that framing is convincing. 2. Portability story is concrete, not hand-wavy. Most recent high-perf MoE / KV-store work quietly as

Weaknesses

1. Related-work positioning is a bit soft. The authors mention NVSHMEM, NIXL, Mooncake’s RDMA transfer engine, etc., but they stop short of a head-to-head systems comparison on all of them in the *same* setting. In particular, NIXL has begun adding EFA; NVSHMEM has both GPU-initiated and host-proxy modes; Mooncake targets KV-centric serving. A tighter comparison table (“who supports EFA,” “who assumes ordering,” “who can do MoE dispatch”) would make the novelty sharper. Right now the contributio

Reviewer 02Rating 6Confidence 3

Strengths

Honestly I was a bit surprised this approach worked at all when first reading the paper, I sort of assume test time access to the functions that had not been "executed" with input/output pairs at train/finetune time would be required, and that really this wouldn't work well outside of chain of thought sort of step by step walking through the functions line by line to compute the results at test time. That functions presented at finetune time are memorized in an executable way is not what I woul

Weaknesses

It feels a bit almost accidental that the way the LLM happens to encode the functions it has seen at finetune time without input/output pairs are able to "lean on" / "borrow" from the input/output pair experience of the functions that had input/output examples. It feels sort of hackey, the empirical results do show this transfer works, but it doesn't feel reliable to me. I was unclear what the RL approach was from the paper, the SFT approach I think is more obvious what you would do at finetune

Reviewer 03Rating 6Confidence 4

Strengths

S1. The paper is well-motivated and clearly written. It will be of interest to both computational linguists who study emergent behaviors, and potentially fine-tuning NLP practicioners who want to finetune models for algorithmic reasoning. S2. The paper explores a diverse array of kinds of data, from simple synthetic python programs to leetcode to ciphers and grammars. S3. The ablations on data and stages answer a number of the questions that I had on first pass, indicating a substantial amoun

Weaknesses

W1. The hypotheses on line 215 about the effect of the acquisition phase of Proactive and on 229 about the exposure phase of Retroactive could use evidence across datasets and models. Are they always doing something, or does the baseline that never sees the unpaired code until test time do just as well? Figure 4 Left and Figure 6 shows results for two task/model class combinations (Llama + Leetcode and Llama + Grammar), giving partial evidence supporting the acquisition phase, but unless I am

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.