Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern
Hongyin Tang, Di Xiu, Lanrui Wang, Xiurui Geng, Jingang Wang, Xunliang, Cai

TL;DR
Ltri-LLM introduces a training-free, streaming inference method for LLMs that leverages local attention patterns to approximate full attention efficiently, enabling long context processing with high accuracy.
Contribution
The paper proposes Ltri-LLM, a novel framework that uses a dynamic triangular attention pattern and offline KV indexing to improve long context inference without additional training.
Findings
Achieves near full attention performance on long text benchmarks.
Maintains efficient streaming inference with reduced computational complexity.
Leverages local attention correlations for effective context chunking.
Abstract
The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggle to achieve performance levels comparable to FA, particularly in retrieval tasks. In this paper, our analysis of attention head patterns reveals that LLMs' attention distributions show strong local correlations, naturally reflecting a chunking mechanism for input context. We propose Ltri-LLM framework, which divides KVs into spans, stores them in an offline index, and retrieves the relevant KVs into memory for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Data Compression Techniques · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need · Feedback Alignment
