Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
Pouya Hamadanian, Sadjad Fouladi

TL;DR
Glinthawk introduces a two-tiered architecture for offline LLM inference that enhances throughput and cost efficiency by offloading attention to a lower-tier compute, enabling larger batch sizes and better resource utilization.
Contribution
It proposes a novel two-tiered architecture for offline LLM inference that separates attention computation from model weights, improving efficiency and scalability.
Findings
5.9x throughput improvement over baselines
2.8x reduction in generation cost
16.3x throughput for long sequences at lower cost
Abstract
We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by and reduces cost of generation by , compared to paged attention baselines. For long sequence lengths, it achieves throughput improvement at less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Power Transformer Diagnostics and Insulation
MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
