Cross-Attention Speculative Decoding

Wei Zhong; Manasa Bharadwaj; Yixiao Wang; Yipeng Ji; Chul Lee

arXiv:2505.24544·cs.CL·February 12, 2026

Cross-Attention Speculative Decoding

Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Yipeng Ji, Chul Lee

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Budget EAGLE, a cross-attention-based Transformer decoder for speculative decoding that simplifies architecture, enhances training efficiency, and maintains competitive inference speedups compared to existing models.

Contribution

The paper presents Budget EAGLE, the first cross-attention-based SD model that eliminates auxiliary components, along with Two-Stage Block-Attention Training for stable, efficient training.

Findings

01

Beagle achieves comparable inference speedups to state-of-the-art models.

02

Beagle demonstrates higher training efficiency across multiple LLMs.

03

The architecture simplifies speculative decoding without sacrificing performance.

Abstract

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we…

Peer Reviews

Decision·Submitted to NeurIPS 2025

Reviewer 01Rating 4Confidence 4

Strengths

### Strengths - **Sound empirical evidence** – Head-to-head comparisons with EAGLE-v2 on three 7 B-parameter backbones, plus ablations on window size and simulation steps, support the technical claims. - **Reduced resource footprint** – Achieves similar ≈ 3 × speculative-decoding speed-ups while lowering peak GPU memory by 10–15 % and cutting parameters by removing pooling/fusion layers. - **Methodological clarity** – Architecture and two-stage training pipeline are clearly illustrated in Figs.

Reviewer 02Rating 4Confidence 4

Strengths

Strengths: 1. Beagle is the first known cross-attention-only SD model without self-attention or pooling layers. 2. It provides a two-stage efficient training method. 3. This paper offers solid theoretical and empirical grounding. Weakness: 1. This method is only tested on 7B models, including two out-dated model series vicuna and llama-2. 2. Only tested using greedy decoding. 3. The contribution in architecture innovation seems to be limited.

Reviewer 03Rating 3Confidence 4

Strengths

**Strengths:** Overall, I like the idea of this paper. The design of the draft model presented here is more concise and elegant compared to EAGLE. Several methodological components are also novel, such as the introduction of test-time training into the cross-attention-based draft model, the design of inverse block attention, and the two-stage training framework. It is evident that the authors have put considerable thought into developing these modules. **Weaknesses:** 1. **Lack of Superior Ac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsError Correcting Code Techniques · Cellular Automata and Applications · Computability, Logic, AI Algorithms

MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding