HAMburger: Accelerating LLM Inference via Token Smashing
Jingyu Liu, Ce Zhang

TL;DR
HAMburger introduces a novel hierarchical approach to LLM inference that reduces resource usage and increases speed by generating multiple tokens simultaneously and trusting self-drafted tokens, significantly improving efficiency.
Contribution
The paper presents HAMburger, a hierarchical model that redefines resource allocation in LLM inference, enabling multi-token generation per step and reducing computation and memory costs.
Findings
KV cache growth is sub-linear with output length.
Up to 2× reduction in KV cache computation.
Up to 2× increase in throughput (TPS).
Abstract
The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsHamburger · Balanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
