Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free   Dynamic Triangular Attention Pattern

Hongyin Tang; Di Xiu; Lanrui Wang; Xiurui Geng; Jingang Wang; Xunliang; Cai

arXiv:2412.04757·cs.CL·December 9, 2024

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

Hongyin Tang, Di Xiu, Lanrui Wang, Xiurui Geng, Jingang Wang, Xunliang, Cai

PDF

Open Access

TL;DR

Ltri-LLM introduces a training-free, streaming inference method for LLMs that leverages local attention patterns to approximate full attention efficiently, enabling long context processing with high accuracy.

Contribution

The paper proposes Ltri-LLM, a novel framework that uses a dynamic triangular attention pattern and offline KV indexing to improve long context inference without additional training.

Findings

01

Achieves near full attention performance on long text benchmarks.

02

Maintains efficient streaming inference with reduced computational complexity.

03

Leverages local attention correlations for effective context chunking.

Abstract

The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggle to achieve performance levels comparable to FA, particularly in retrieval tasks. In this paper, our analysis of attention head patterns reveals that LLMs' attention distributions show strong local correlations, naturally reflecting a chunking mechanism for input context. We propose Ltri-LLM framework, which divides KVs into spans, stores them in an offline index, and retrieves the relevant KVs into memory for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Advanced Data Compression Techniques · Music and Audio Processing

MethodsSoftmax · Attention Is All You Need · Feedback Alignment