SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference   Acceleration

Heming Xia; Yongqi Li; Jun Zhang; Cunxiao Du; Wenjie Li

arXiv:2410.06916·cs.CL·March 7, 2025

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li

PDF

Open Access 1 Repo

TL;DR

SWIFT introduces an adaptive, self-speculative decoding method that accelerates large language model inference by skipping layers on-the-fly without extra training or models, achieving significant speedups.

Contribution

It presents a novel, plug-and-play, layer-skipping decoding algorithm that adaptively accelerates LLM inference without auxiliary models or additional training.

Findings

01

Achieves 1.3x-1.6x speedup in inference.

02

Preserves the original output distribution.

03

Works across diverse models and tasks.

Abstract

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to verify them in parallel. While this technique has achieved notable speedups, most existing approaches necessitate either additional parameters or extensive training to construct effective draft models, thereby restricting their applicability across different LLMs and tasks. To address this limitation, we explore a novel plug-and-play SD solution with layer-skipping, which skips intermediate layers of the target LLM as the compact draft model. Our analysis reveals that LLMs exhibit great potential for self-acceleration through layer sparsity and the task-specific nature of this sparsity. Building on these insights, we introduce SWIFT, an on-the-fly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hemingkx/SWIFT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Magnetic confinement fusion research · Network Packet Processing and Optimization