Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs
Chenxi Sun, Hongzhi Zhang, Zijia Lin, Jingyuan Zhang, Fuzheng Zhang,, Zhongyuan Wang, Bin Chen, Chengru Song, Di Zhang, Kun Gai, Deyi Xiong

TL;DR
This paper presents Lexical Unit Decoding (LUD), a novel parallel decoding method for large language models that significantly speeds up generation without sacrificing quality, enabling more real-time applications.
Contribution
LUD introduces a data-driven, architecture-agnostic approach to parallel decoding by predicting lexical units, improving speed while maintaining output quality, and can be combined with other methods.
Findings
33% speed-up in natural language generation with no quality loss
30% speed-up in code generation with negligible quality loss
No auxiliary models or architecture changes needed
Abstract
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems · Mathematics, Computing, and Information Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
