Mixture of Attentions For Speculative Decoding

Matthieu Zimmer; Milan Gritta; Gerasimos Lampouras; Haitham Bou Ammar,; Jun Wang

arXiv:2410.03804·cs.CL·April 4, 2025

Mixture of Attentions For Speculative Decoding

Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar,, Jun Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a Mixture of Attentions architecture for speculative decoding in large language models, achieving faster decoding speeds and robustness in both single-device and client-server scenarios.

Contribution

It proposes a novel Mixture of Attentions architecture for small models in speculative decoding, improving speed, accuracy, and deployment flexibility over existing methods.

Findings

01

Achieves 9.5% speedup on EAGLE-2

02

Demonstrates state-of-the-art latency with minimal server calls

03

Maintains higher accuracy during disconnections

Abstract

The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The client-server framework with the ability to handle disconnections positions the approach as a practical advancement for deploying LLMs. - The introduction of LSA and CA layers to mitigate partial observability and improve on-policyness makes sense.

Weaknesses

1. The paper does not thoroughly justify the choice of parameter configurations and its training in its experiments. As discussed in the Yi et al. (2024), the training dataset and the choice of number of parameters can significantly affect the SD performance, but this paper does not [A]. [A] Yi et al., 2024. Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, EMNLP 2024-main. 2. Discussions for the memory-bound nature of LLM is required in the paper. 3. Th

Reviewer 02Rating 8Confidence 2

Strengths

The organization and flow of the paper is very good. The background section is particularly thorough and helpful. The problem is well-explained (e.g. partial observability and lack of on-policyness are both detailed when explaining the methodology) so it is made clear what exactly the Mixture of Attentions method is aiming to solve. Additionally, the related work is well-addressed. It is clear exactly how this work is different from prior solutions. It is great that the client-server scenario

Weaknesses

The experimentation is very narrow, especially since it only focuses on one model architecture and the improvements over EAGLE seem relatively small and inconsistent. It is therefore not convincing that this method would be effective more generally. It is not very clear why this problem/contribution is important. The paper would be stronger if the method was motivated by some real-world example where SD may be used, but would lead to significant problems that Mixture of Attentions would mitigat

Reviewer 03Rating 8Confidence 4

Strengths

- The work improves above EAGLE-2 and seems to achieve state-of-the-art results. - The work provides a good background on speculative decoding - The work proposes an interesting client-server setup that fits well with the speculative decoding technique

Weaknesses

- The work lacks an overall view and clear statements that can improve readability. - Method intuition: the method section only lays out the information of each component but does not provide an overall view of the proposed method as well as motivating intuitions for each design. The necessary intuitive descriptions are also not found in the appendix. - Experiment result: the work only compares to one prior work, EAGLE-2, as a baseline, but did not provide information on how well EAGLE-2

Code & Models

Repositories

huawei-noah/hebo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Education Research · Neural Networks and Applications · Fractal and DNA sequence analysis