Falcon: Faster and Parallel Inference of Large Language Models through   Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Xiangxiang Gao; Weisheng Xie; Yiwei Xiang; Feng Ji

arXiv:2412.12639·cs.CL·April 23, 2025

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji

PDF

Open Access

TL;DR

Falcon is a novel semi-autoregressive speculative decoding framework that significantly accelerates large language model inference by enhancing parallelism and accuracy through specialized techniques and a custom decoding structure.

Contribution

Introduces Falcon, a semi-autoregressive decoding method with a new distillation technique and decoding tree, achieving faster inference with minimal model complexity increase.

Findings

01

Achieves 2.91x to 3.51x speedup on benchmark datasets.

02

Outperforms existing speculative decoding methods in speed and accuracy.

03

Maintains high output quality with a lightweight drafter architecture.

Abstract

Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Dropout · Lookahead · Dense Connections · Byte Pair Encoding · Multi-Head Attention · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Adam · Layer Normalization