Self-Selected Attention Span for Accelerating Large Language Model   Inference

Tian Jin; Wanzin Yazar; Zifei Xu; Sayeh Sharify; Xin Wang

arXiv:2404.09336·cs.CL·April 16, 2024·1 cites

Self-Selected Attention Span for Accelerating Large Language Model Inference

Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang

PDF

Open Access

TL;DR

This paper introduces a method for LLMs to self-identify minimal attention spans during inference, enabling sparse attention masks and a 28% increase in inference throughput for tasks like arithmetic evaluation and news summarization.

Contribution

It presents a novel approach where LLMs learn to determine minimal attention spans, allowing for on-the-fly sparse attention during inference, significantly improving efficiency.

Findings

01

28% increase in inference throughput

02

Effective self-identified attention span selection

03

Improved efficiency in real-world tasks

Abstract

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis