Executing as You Generate: Hiding Execution Latency in LLM Code Generation
Zhensu Sun, Zhihao Lin, Zhi Chen, Chengran Yang, Mingyi Zhou, Li Li, David Lo

TL;DR
This paper introduces Eager, a method for executing code during generation to significantly reduce latency in LLM-based coding agents by parallelizing generation, detection, and execution stages.
Contribution
It formalizes a parallel execution paradigm for LLM code generation, derives latency bounds, and presents Eager, an implementation that achieves substantial latency reductions.
Findings
Eager reduces non-overlapped execution latency by up to 99.9%.
Eager cuts end-to-end latency by up to 55%.
Effective across multiple LLMs and benchmarks.
Abstract
Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
