WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding
Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun

TL;DR
WGRAMMAR introduces a novel decoding approach that leverages prior knowledge to significantly accelerate structured output generation in large language models, reducing latency by up to 250 times.
Contribution
The paper presents WGRAMMAR, a lightweight decoding engine that decomposes constraints into static and dynamic parts, enabling faster structured decoding without relying on pushdown automata.
Findings
Achieves up to 250x speedup over existing systems.
Effectively models regular formats using compositional operators.
Reduces decoding latency by precompiling static structures.
Abstract
Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
