TidalDecode: Fast and Accurate LLM Decoding with Position Persistent   Sparse Attention

Lijie Yang; Zhihao Zhang; Zhuofu Chen; Zikun Li; Zhihao Jia

arXiv:2410.05076·cs.LG·October 8, 2024

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

PDF

Open Access 1 Repo

TL;DR

TidalDecode introduces a position persistent sparse attention mechanism that enhances LLM decoding speed and accuracy by leveraging spatial coherence and selective full attention, significantly reducing latency without quality loss.

Contribution

It proposes a novel position persistent sparse attention algorithm that improves decoding efficiency and accuracy in large language models by combining sparse and full attention layers.

Findings

01

Reduces LLM decoding latency by up to 2.1x

02

Maintains high-quality generation comparable to full attention

03

Effectively leverages spatial coherence in token selection

Abstract

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DerrickYLJ/TidalDecode
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Blind Source Separation Techniques

MethodsAttention Is All You Need · Sparse Evolutionary Training · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding