VORT: Adaptive Power-Law Memory for NLP Transformers
Nabil Mlaiki

TL;DR
VORT introduces a learnable power-law memory architecture for NLP transformers, better capturing long-range dependencies by approximating fractional kernels with sum-of-exponentials for efficient, adaptive retrieval.
Contribution
The paper proposes a novel memory mechanism using fractional power-law kernels with SOE approximation, enabling adaptive, non-Markovian token retention in transformers.
Findings
VORT outperforms prior models on Zipf-distributed retrieval tasks.
The architecture effectively captures long-range dependencies in language.
Synthetic experiments demonstrate the advantage of power-law kernels over prior-matching methods.
Abstract
Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order \alpha_i\in[\delta,1] that governs a Gr\"unwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
