Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement
Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang

TL;DR
Cerberus is a novel adaptive parallel decoding framework for large language models that balances accuracy and efficiency, achieving significant speedups and improved generation quality over existing methods.
Contribution
It introduces a gating mechanism and a new decoding head paradigm to adaptively select decoding strategies, enhancing inference speed and quality.
Findings
Achieves up to 2.12x speedup over auto-regressive decoding.
Outperforms Medusa with 10-30% higher acceleration.
Provides superior generation quality.
Abstract
Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have identified two key issues with existing parallel decoding frameworks: (1) decoding heads fail to balance prediction accuracy and the parallelism of execution, and (2) parallel decoding is not a universal solution, as it can bring unnecessary overheads at some challenging decoding steps. To address these issues, we propose Cerberus, an adaptive parallel decoding framework introduces the gating mechanism to enable the LLMs to adaptively choose appropriate decoding approaches at each decoding step, along with introducing a new paradigm of decoding heads that introduce the sequential knowledge while maintaining execution parallelism. The experiment results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBerberine and alkaloids research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
