Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

Pouya Hamadanian; Sadjad Fouladi

arXiv:2501.11779·cs.LG·February 12, 2025

Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

Pouya Hamadanian, Sadjad Fouladi

PDF

Open Access 1 Repo

TL;DR

Glinthawk introduces a two-tiered architecture for offline LLM inference that enhances throughput and cost efficiency by offloading attention to a lower-tier compute, enabling larger batch sizes and better resource utilization.

Contribution

It proposes a novel two-tiered architecture for offline LLM inference that separates attention computation from model weights, improving efficiency and scalability.

Findings

01

5.9x throughput improvement over baselines

02

2.8x reduction in generation cost

03

16.3x throughput for long sequences at lower cost

Abstract

We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9 \times$ and reduces cost of generation by $2.8 \times$ , compared to paged attention baselines. For long sequence lengths, it achieves $16.3 \times$ throughput improvement at $2.4 \times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/glinthawk
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Power Transformer Diagnostics and Insulation

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer