TL;DR
This paper introduces a theoretical framework and practical methods to prevent output stalling in LLM agents during large document synthesis, significantly reducing token usage and improving reliability.
Contribution
The authors develop the Output Generation Capacity measure, prove a Format-Cost Separation Theorem, and formalize an adaptive strategy for optimal output generation, validated across multiple models and document types.
Findings
Deferred rendering reduces token usage by 48-72%.
Output stalling is eliminated with the proposed methods.
The framework is implemented as an open-source tool, GEN-PILOT.
Abstract
LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier , and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
