HAMburger: Accelerating LLM Inference via Token Smashing

Jingyu Liu; Ce Zhang

arXiv:2505.20438·cs.CL·May 28, 2025

HAMburger: Accelerating LLM Inference via Token Smashing

Jingyu Liu, Ce Zhang

PDF

Open Access

TL;DR

HAMburger introduces a novel hierarchical approach to LLM inference that reduces resource usage and increases speed by generating multiple tokens simultaneously and trusting self-drafted tokens, significantly improving efficiency.

Contribution

The paper presents HAMburger, a hierarchical model that redefines resource allocation in LLM inference, enabling multi-token generation per step and reducing computation and memory costs.

Findings

01

KV cache growth is sub-linear with output length.

02

Up to 2× reduction in KV cache computation.

03

Up to 2× increase in throughput (TPS).

Abstract

The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsHamburger · Balanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings