TL;DR
Pre$^3$ introduces a method to convert LR(1) grammars into deterministic pushdown automata, significantly improving structured LLM generation efficiency by reducing token processing time and increasing throughput.
Contribution
It presents a novel approach to transform LR(1) transition graphs into DPDAs, enabling faster and more efficient structured output generation in LLMs.
Findings
Reduced time per output token by up to 40%.
Increased throughput by up to 36%.
Seamless integration into standard inference frameworks.
Abstract
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
