Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Tenghui Li; Guoxu Zhou; Xuyang Zhao; Yuning Qiu; Qibin Zhao

arXiv:2510.23649·cs.LG·December 24, 2025

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

PDF

1 Video

TL;DR

This paper introduces LRQK, a low-rank attention method that reduces memory and computational costs for long-context inference in large language models, while maintaining high accuracy.

Contribution

LRQK is a novel two-stage low-rank decomposition framework that efficiently computes attention with reduced memory and data transfer, outperforming existing sparse attention methods.

Findings

01

LRQK achieves comparable or better accuracy than sparse attention methods.

02

LRQK significantly reduces memory usage during long-context inference.

03

LRQK maintains exact attention outputs with minimal accuracy loss.

Abstract

As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. In this work, Low Rank Query and Key attention (LRQK) is introduced, a two-stage framework that jointly decomposes full-precision query and key matrices into compact rank-\(r\) factors during the prefill stage, and then employs these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top-\(k\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism where only missing full-precision KV pairs are transferred, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Low Rank Attention for Long-Context Inference in Large Language Models· slideslive