Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Xianzhen Luo; Yixuan Wang; Qingfu Zhu; Zhiming Zhang; Xuanyu Zhang; Qing Yang; Dongliang Xu

arXiv:2408.08696·cs.CL·May 27, 2025

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces Token Recycling, a novel method that accelerates large language model inference by reusing candidate tokens with minimal storage, achieving around 2x speedup and outperforming existing train-free techniques.

Contribution

Token Recycling is a new, storage-efficient approach that leverages candidate token reoccurrence to significantly speed up LLM inference without additional training.

Findings

01

Achieves approximately 2x inference speedup across various LLM sizes.

02

Requires less than 2MB of additional storage.

03

Outperforms existing train-free methods by 30% and training methods by 25%.

Abstract

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling