Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise
Bo Long, Deepak Agarwal, Jelena Markovic-Voronov, Yi Wang, Liuqing Li

TL;DR
This paper introduces the Bayesian Filtering Transformer (BFT), a novel transformer model that explicitly models uncertainty using Kalman filtering and kriging, leading to improved performance in recommendation and language tasks.
Contribution
The paper presents BFT, integrating Bayesian filtering techniques into transformers to handle uncertainty, with negligible overhead and significant empirical improvements.
Findings
BFT improves recommendation accuracy, especially for cold-start users and rare items.
BFT enhances robustness of language models to noisy supervision and context.
Significant gains achieved across multiple benchmarks with minimal additional computation.
Abstract
The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emph{uncertainty}, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emph{Bayesian Filtering Transformer} (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
