Cerberus: Efficient Inference with Adaptive Parallel Decoding and   Sequential Knowledge Enhancement

Yuxuan Liu; Wenyuan Li; Laizhong Cui; Hailiang Yang

arXiv:2410.13344·cs.CL·October 18, 2024

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang

PDF

Open Access

TL;DR

Cerberus is a novel adaptive parallel decoding framework for large language models that balances accuracy and efficiency, achieving significant speedups and improved generation quality over existing methods.

Contribution

It introduces a gating mechanism and a new decoding head paradigm to adaptively select decoding strategies, enhancing inference speed and quality.

Findings

01

Achieves up to 2.12x speedup over auto-regressive decoding.

02

Outperforms Medusa with 10-30% higher acceleration.

03

Provides superior generation quality.

Abstract

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have identified two key issues with existing parallel decoding frameworks: (1) decoding heads fail to balance prediction accuracy and the parallelism of execution, and (2) parallel decoding is not a universal solution, as it can bring unnecessary overheads at some challenging decoding steps. To address these issues, we propose Cerberus, an adaptive parallel decoding framework introduces the gating mechanism to enable the LLMs to adaptively choose appropriate decoding approaches at each decoding step, along with introducing a new paradigm of decoding heads that introduce the sequential knowledge while maintaining execution parallelism. The experiment results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBerberine and alkaloids research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings