Mixture of Attentions For Speculative Decoding
Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar,, Jun Wang

TL;DR
This paper introduces a Mixture of Attentions architecture for speculative decoding in large language models, achieving faster decoding speeds and robustness in both single-device and client-server scenarios.
Contribution
It proposes a novel Mixture of Attentions architecture for small models in speculative decoding, improving speed, accuracy, and deployment flexibility over existing methods.
Findings
Achieves 9.5% speedup on EAGLE-2
Demonstrates state-of-the-art latency with minimal server calls
Maintains higher accuracy during disconnections
Abstract
The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a…
Peer Reviews
Decision·ICLR 2025 Poster
- The client-server framework with the ability to handle disconnections positions the approach as a practical advancement for deploying LLMs. - The introduction of LSA and CA layers to mitigate partial observability and improve on-policyness makes sense.
1. The paper does not thoroughly justify the choice of parameter configurations and its training in its experiments. As discussed in the Yi et al. (2024), the training dataset and the choice of number of parameters can significantly affect the SD performance, but this paper does not [A]. [A] Yi et al., 2024. Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters, EMNLP 2024-main. 2. Discussions for the memory-bound nature of LLM is required in the paper. 3. Th
The organization and flow of the paper is very good. The background section is particularly thorough and helpful. The problem is well-explained (e.g. partial observability and lack of on-policyness are both detailed when explaining the methodology) so it is made clear what exactly the Mixture of Attentions method is aiming to solve. Additionally, the related work is well-addressed. It is clear exactly how this work is different from prior solutions. It is great that the client-server scenario
The experimentation is very narrow, especially since it only focuses on one model architecture and the improvements over EAGLE seem relatively small and inconsistent. It is therefore not convincing that this method would be effective more generally. It is not very clear why this problem/contribution is important. The paper would be stronger if the method was motivated by some real-world example where SD may be used, but would lead to significant problems that Mixture of Attentions would mitigat
- The work improves above EAGLE-2 and seems to achieve state-of-the-art results. - The work provides a good background on speculative decoding - The work proposes an interesting client-server setup that fits well with the speculative decoding technique
- The work lacks an overall view and clear statements that can improve readability. - Method intuition: the method section only lays out the information of each component but does not provide an overall view of the proposed method as well as motivating intuitions for each design. The necessary intuitive descriptions are also not found in the appendix. - Experiment result: the work only compares to one prior work, EAGLE-2, as a baseline, but did not provide information on how well EAGLE-2
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Education Research · Neural Networks and Applications · Fractal and DNA sequence analysis
