Block Transformer: Global-to-Local Language Modeling for Fast Inference

Namgyu Ho; Sangmin Bae; Taehyeon Kim; Hyunjik Jo; Yireun Kim; Tal; Schuster; Adam Fisch; James Thorne; Se-Young Yun

arXiv:2406.02657·cs.CL·November 4, 2024·1 cites

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal, Schuster, Adam Fisch, James Thorne, Se-Young Yun

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces the Block Transformer, a hierarchical global-to-local language model that significantly accelerates inference speed by reducing memory bottlenecks, while maintaining comparable performance to traditional transformers.

Contribution

It proposes a novel global-to-local attention architecture that mitigates inference bottlenecks in autoregressive transformers, enabling 10-20x faster inference without performance loss.

Findings

01

Achieves 10-20x inference throughput increase

02

Maintains equivalent perplexity and zero-shot performance

03

Demonstrates effective global-to-local modeling approach

Abstract

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache. Second, computation of subsequent tokens is bottlenecked by the high memory I/O demand of fetching the entire KV cache, which grows linearly with sequence length, incurring quadratic memory reads overall. We design the Block Transformer to strategically mitigate these costs, by incorporating coarsity and locality into an integrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

itsnamgyu/block-transformer
pytorchOfficial

Videos

Block Transformer: Global-to-Local Language Modeling for Fast Inference· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention