SVRG and Beyond via Posterior Correction

Nico Daheim; Thomas M\"ollenhoff; Ming Liang Ang; Mohammad Emtiyaz Khan

arXiv:2512.01930·cs.LG·December 2, 2025

SVRG and Beyond via Posterior Correction

Nico Daheim, Thomas M\"ollenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

PDF

Open Access 3 Reviews

TL;DR

This paper reveals foundational links between SVRG and Bayesian posterior correction, introducing new variants that enhance deep learning training, including Hessian and Adam-like methods for Transformer models.

Contribution

It establishes a novel connection between SVRG and Bayesian posterior correction, leading to new variants with improved deep learning training capabilities.

Findings

01

SVRG is a special case of posterior correction over Gaussian families.

02

Introduces a Newton-like SVRG variant with Hessian corrections.

03

Develops an Adam-like extension improving Transformer training.

Abstract

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- The connection between SVRG and posterior correction is sound. - An advantage of the connection is that it generalizes the variance reduction method to higher-order optimizers, leading to a novel IVON-PC.

Weaknesses

- While the theoretical analysis provides a fresh perspective on SVRG, the resulting IVON-PC method does not appear to offer practical benefits. - Figure 3 shows an initial improvement, but IVON-PC ultimately requires a similar number of gradient computations as SVRG to reach comparable final performance. Likewise, Figure 5 demonstrates that their final performance remains nearly identical. - The gains reported in Table 5 are also insignificant, especially when considering the error bars.

Reviewer 02Rating 4Confidence 3

Strengths

1. The original contribution by establishing a connection between SVRG and posterior correction. The interpretation reframes variance reduction as a form of knowledge transfer, a new perspective that unifies two previously separate research threads. 2. The authors successfully extend this connection to derive two new SVRG variants: One is a Newton-like method incorporating stochastic variance reduction for both gradients and Hessians, introducing Hessian corrections rarely explored in prior work

Weaknesses

1. Compared with the well established optimization method e.g., AdamW, posterior correction does not yield clear improvements in deep learning tasks. 2. Can the variance reduction method be applied to reinforcement learning? The authors can explore the experiments on either RLVR of LLMs post-training or some other traditional tasks in RL area. It can also compared with TRPO (Trust Region Policy Optimization).

Reviewer 03Rating 2Confidence 4

Strengths

- The mathematical connection between SVRG and PC is novel. - A strong and wide-ranging set of experiments together with a comprehensive set of ablations demonstrating the power of combining IVON with PC.

Weaknesses

- While the equivalence of SVRG as a special case of PC is interesting, the paper reads like a follow-up to Khan (2025) that demonstrates how PC and IVON can be combined to markedly improve the performance of the latter, a comparison that was missing in Khan (2025). The SVRG connection feels like an afterthought that does not provide deeper insights, e.g., the extent to which convergence properties from SVRG still hold or can be extended to non-isotropic Gaussians, or the extent to which smoothn

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Natural Language Processing Techniques