TL;DR
Gated KalmaNet (GKA) introduces a full-past, efficient state-space layer based on Kalman filtering, improving long-context recall and stability in low-precision settings, with demonstrated superior performance on various tasks.
Contribution
GKA maintains full past information with Kalman filter-based exact updates, addressing stability and parallelization issues in low-precision environments, and outperforms existing SSM layers.
Findings
GKA outperforms existing SSM layers on short and long-context tasks.
GKA achieves over 10% relative improvement on RAG and LongQA up to 128k tokens.
GKA outperforms Mamba in ImageNet classification.
Abstract
Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
