Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng; Aditya Chattopadhyay; Luca Zancato; Elvis Nunez; Wei Xia; Stefano Soatto

arXiv:2511.21016·cs.LG·May 19, 2026

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

PDF

1 Repo 4 Models

TL;DR

Gated KalmaNet (GKA) introduces a full-past, efficient state-space layer based on Kalman filtering, improving long-context recall and stability in low-precision settings, with demonstrated superior performance on various tasks.

Contribution

GKA maintains full past information with Kalman filter-based exact updates, addressing stability and parallelization issues in low-precision environments, and outperforms existing SSM layers.

Findings

01

GKA outperforms existing SSM layers on short and long-context tasks.

02

GKA achieves over 10% relative improvement on RAG and LongQA up to 128k tokens.

03

GKA outperforms Mamba in ImageNet classification.

Abstract

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques