LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models
Hossein Abdi, Mingfei Sun, Andi Zhang, Samuel Kaski, Wei Pan

TL;DR
LoKO introduces a low-rank Kalman filter-based optimizer for online fine-tuning large models, reducing computational costs and improving convergence and performance across vision and language tasks.
Contribution
We formulate PEFT as an optimal filtering problem and develop LoKO, a novel low-rank Kalman optimizer that efficiently estimates parameters with reduced complexity.
Findings
LoKO converges faster than traditional optimizers.
LoKO achieves better performance on vision and language tasks.
The method significantly reduces computational complexity.
Abstract
Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the…
Peer Reviews
Decision·Submitted to ICLR 2025
LoKO shows that Kalman optimizers work similarly well with LoRA as they do with bigger weight matrices in a neural network with better and faster convergence and efficient online learning.
"Novelty" The novelty of the work seems hard to gauge and should be explicitly noted in the paper. The use of EKF as an alternative optimizing strategy has already been pursued and its efficacy for faster convergence has been demonstrated in previous works some of which are cited in the paper. The diagonal approximation (amongst other approximation attempts as in the cited, Chang et al.) for the covariance matrix to reduce computations has also been utilized in a number of previous works regardi
Casting PEFT as a state estimation problem using the Kalman filter is a novel and insightful approach. The combination of LoRA with EKF addresses the scalability issues traditionally associated with Kalman filters in large-scale models. The diagonal approximation of the covariance matrix and EMA-based estimation significantly reduce computational complexity, making the approach practical for large models. The paper provides a clear and well-founded mathematical formulation of the proposed metho
The paper lacks a formal convergence analysis of the proposed algorithm. Without theoretical guarantees, it's unclear under what conditions LoKO is expected to perform reliably. The justification for the diagonal approximation of the covariance matrix is primarily empirical. A theoretical analysis or conditions under which this approximation holds would strengthen the contribution. The theoretical implications of using EMA for observation noise covariance estimation are not fully explored. Under
LoKO seems to be an interesting approach that successfully combine the EKF and the LORA, this provides a good new perspective to approach the online gradient-descent method. This paper provides a well structured experiments across computer vision and NLP. The results in general are not state-of-the-art, but they are decent to show the promising results of the LoKO. The proof to show that the proposed approach equation is equivalent to that in the vanilla EKF algorithm seems to be sound. The p
The evaluation is limited. This paper has done quite a few evaluation on the proposed LoKO method. However, i find none of those datasets are well suited for the LoKO. In general, the model full fine-tuning can do better than that from the LoRA fine-tuning. The LoKO would be suboptimal compared to the model full-fining as long as one can afford the GPU computes. The proposed experiments do not cover the dynamic streaming type of data, which would be able to show the strength of the proposed LoKO
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Computer Graphics and Visualization Techniques · Distributed and Parallel Computing Systems
