WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

Ran Wang; Xiaoxuan Liu; Hao Ren; Gang Chen; Fanchao Qi; Maosong Sun

arXiv:2507.16768·cs.AI·July 23, 2025

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun

PDF

Open Access

TL;DR

WGRAMMAR introduces a novel decoding approach that leverages prior knowledge to significantly accelerate structured output generation in large language models, reducing latency by up to 250 times.

Contribution

The paper presents WGRAMMAR, a lightweight decoding engine that decomposes constraints into static and dynamic parts, enabling faster structured decoding without relying on pushdown automata.

Findings

01

Achieves up to 250x speedup over existing systems.

02

Effectively models regular formats using compositional operators.

03

Reduces decoding latency by precompiling static structures.

Abstract

Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression