Token-Picker: Accelerating Attention in Text Generation with Minimized   Memory Transfer via Probability Estimation

Junyoung Park; Myeonggu Kang; Yunki Han; Yanggon Kim; Jaekang Shin,; Lee-Sup Kim

arXiv:2407.15131·cs.AR·July 23, 2024

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin,, Lee-Sup Kim

PDF

Open Access

TL;DR

Token-Picker introduces a probability estimation method to efficiently prune low-attention tokens in text generation, significantly reducing memory usage and accelerating performance without fine-tuning.

Contribution

It proposes a novel probability estimation technique for token pruning and a hardware design to minimize off-chip memory access in text generation models.

Findings

01

12.1x token pruning ratio without fine-tuning

02

2.6x reduction in memory accesses

03

2.3x speedup and 2.4x energy efficiency improvements

Abstract

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques

MethodsAttention Is All You Need · Softmax · Pruning