Decoding at the Speed of Thought: Harnessing Parallel Decoding of   Lexical Units for LLMs

Chenxi Sun; Hongzhi Zhang; Zijia Lin; Jingyuan Zhang; Fuzheng Zhang,; Zhongyuan Wang; Bin Chen; Chengru Song; Di Zhang; Kun Gai; Deyi Xiong

arXiv:2405.15208·cs.CL·May 27, 2024

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Chenxi Sun, Hongzhi Zhang, Zijia Lin, Jingyuan Zhang, Fuzheng Zhang,, Zhongyuan Wang, Bin Chen, Chengru Song, Di Zhang, Kun Gai, Deyi Xiong

PDF

Open Access

TL;DR

This paper presents Lexical Unit Decoding (LUD), a novel parallel decoding method for large language models that significantly speeds up generation without sacrificing quality, enabling more real-time applications.

Contribution

LUD introduces a data-driven, architecture-agnostic approach to parallel decoding by predicting lexical units, improving speed while maintaining output quality, and can be combined with other methods.

Findings

01

33% speed-up in natural language generation with no quality loss

02

30% speed-up in code generation with negligible quality loss

03

No auxiliary models or architecture changes needed

Abstract

Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Library Science and Information Systems · Mathematics, Computing, and Information Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings