AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou; Miguel L\'azaro-Gredilla; Antoine Dedieu; Carter Wendelken; Wolfgang Lehrach; Kevin P. Murphy

arXiv:2603.03329·cs.CL·March 5, 2026

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou, Miguel L\'azaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

PDF

Open Access

TL;DR

AutoHarness automatically creates code harnesses for LLM agents, preventing illegal actions and enabling smaller models to outperform larger ones by synthesizing policies in code, leading to cost-effective improvements.

Contribution

The paper introduces a method for automatically synthesizing code harnesses for LLM agents, enhancing safety and performance without manual intervention.

Findings

01

Automatically prevents illegal moves in 145 games

02

Smaller Gemini-2.5-Flash outperforms larger models

03

Synthesizing policies in code improves reward scores

Abstract

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Games · Software Engineering Research