Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji

TL;DR
Falcon is a novel semi-autoregressive speculative decoding framework that significantly accelerates large language model inference by enhancing parallelism and accuracy through specialized techniques and a custom decoding structure.
Contribution
Introduces Falcon, a semi-autoregressive decoding method with a new distillation technique and decoding tree, achieving faster inference with minimal model complexity increase.
Findings
Achieves 2.91x to 3.51x speedup on benchmark datasets.
Outperforms existing speculative decoding methods in speed and accuracy.
Maintains high output quality with a lightweight drafter architecture.
Abstract
Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Dropout · Lookahead · Dense Connections · Byte Pair Encoding · Multi-Head Attention · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam · Layer Normalization
