Lossless Acceleration of Large Language Models with Hierarchical   Drafting based on Temporal Locality in Speculative Decoding

Sukmin Cho; Sangjin Choi; Taeho Hwang; Jeongyeon Seo; Soyeong Jeong,; Huije Lee; Hoyun Song; Jong C. Park; Youngjin Kwon

arXiv:2502.05609·cs.CL·February 11, 2025

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong,, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Hierarchy Drafting, a lossless speculative decoding method that organizes token sources hierarchically based on temporal locality, significantly improving inference speed and consistency across various large language models and tasks.

Contribution

The paper proposes Hierarchy Drafting, a novel hierarchical framework for speculative decoding that enhances speed and consistency without loss of accuracy.

Findings

01

HD outperforms existing database drafting methods in speed.

02

HD achieves robust inference speedups across different model sizes and tasks.

03

HD maintains consistent acceleration with minimal latency.

Abstract

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zomss/Hierarchy_Drafting
pytorchOfficial

Videos

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings