Amphista: Bi-directional Multi-head Decoding for Accelerating LLM   Inference

Zeping Li; Xinlong Yang; Ziheng Gao; Ji Liu; Guanchen Li; Zhuang Liu,; Dong Li; Jinzhang Peng; Lu Tian; Emad Barsoum

arXiv:2406.13170·cs.AI·October 21, 2024

Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference

Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu,, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum

PDF

Open Access 1 Video

TL;DR

Amphista is a novel decoding framework for LLMs that accelerates inference by enabling bi-directional interaction among parallel decoding heads, significantly improving speed while preserving quality.

Contribution

It introduces a bi-directional attention mechanism and staged adaptation layers to enhance parallel decoding in LLMs, surpassing prior methods like Medusa.

Findings

01

Achieves up to 2.75× speedup over autoregressive decoding.

02

Attains 1.40× faster inference than Medusa on Vicuna 33B.

03

Maintains comparable generation quality with accelerated decoding.

Abstract

Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed. While methods such as Medusa constructs parallelized heads, they lack adequate information interaction across different prediction positions. To overcome this limitation, we introduce Amphista, an enhanced speculative decoding framework that builds upon Medusa. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista integrates Staged Adaptation Layers, which ensure a seamless transition of semantic information from the target model's autoregressive inference to the drafting heads' non-autoregressive inference, effectively achieving paradigm shift and feature fusion. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Balanced Selection