Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference
Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Guanchen Li, Zhuang Liu,, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum

TL;DR
Amphista is a novel decoding framework for LLMs that accelerates inference by enabling bi-directional interaction among parallel decoding heads, significantly improving speed while preserving quality.
Contribution
It introduces a bi-directional attention mechanism and staged adaptation layers to enhance parallel decoding in LLMs, surpassing prior methods like Medusa.
Findings
Achieves up to 2.75× speedup over autoregressive decoding.
Attains 1.40× faster inference than Medusa on Vicuna 33B.
Maintains comparable generation quality with accelerated decoding.
Abstract
Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speed. While methods such as Medusa constructs parallelized heads, they lack adequate information interaction across different prediction positions. To overcome this limitation, we introduce Amphista, an enhanced speculative decoding framework that builds upon Medusa. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista integrates Staged Adaptation Layers, which ensure a seamless transition of semantic information from the target model's autoregressive inference to the drafting heads' non-autoregressive inference, effectively achieving paradigm shift and feature fusion. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Balanced Selection
