Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
Ziyang Liu

TL;DR
This paper introduces Copy-as-Decode, a grammar-constrained decoding method for efficient and valid text and code editing with large language models, eliminating the need for full autoregressive regeneration.
Contribution
It presents a novel decoding-layer mechanism that enables parallel copying and generation, significantly speeding up editing tasks while maintaining high token coverage and validity.
Findings
Parallel prefill speeds up copying by up to 303x on large models.
74-98% of gold tokens are reachable with line-level primitives.
Oracle programs round-trip successfully, localizing failures to span selection.
Abstract
LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying tokens via parallel prefill is -- faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
