How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
Sungwoo Jung, Seonil Son

TL;DR
This paper quantifies the residual role of large language models in planning agents by externalizing and measuring different harness layers, revealing that much of the agent's competence can be achieved without frequent LLM calls.
Contribution
It introduces a methodology to measure the LLM's residual contribution in a layered planning agent, separating it from other harness components.
Findings
Declarative planning significantly improves win rate without LLM calls.
Symbolic reflection effects are calibration-sensitive and cancel out overall.
LLM-backed revision activates on only 4.3% of turns, with limited impact.
Abstract
Agent harnesses -- the stateful programs that wrap a language model and decide what it sees at each step -- are now known to change end-to-end performance on a fixed model by as much as six times. That raises a question asked less often than it should be: how much of an agent's competence does the harness itself already carry, and how much genuinely still needs the LLM? We externalize a planning harness for noisy Collaborative Battleship into four progressively richer layers -- posterior belief tracking, declarative planning, symbolic reflec tion, and an LLM-backed revision gate -- under a common runtime, taking \emph{win rate} as the primary metric and \emph{F1} as secondary, and pre-specifying \emph{heavy lifting} as the single largest positive marginal to the primary metric. Across 54 games, declarative pla nning carries the heavy lifting (pp win rate over a belief-only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
