Variational Deep Learning via Implicit Regularization

Jonathan Wenger; Beau Coker; Juraj Marusic; John P. Cunningham

arXiv:2505.20235·cs.LG·March 17, 2026

Variational Deep Learning via Implicit Regularization

Jonathan Wenger, Beau Coker, Juraj Marusic, John P. Cunningham

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel regularization method for variational neural networks that leverages the implicit bias of gradient descent, leading to improved in- and out-of-distribution performance with minimal tuning.

Contribution

It provides a theoretical characterization of the implicit regularization effect in overparametrized models and empirically demonstrates its effectiveness without extra hyperparameter tuning.

Findings

01

Strong in-distribution performance

02

Robust out-of-distribution generalization

03

Minimal computational overhead

Abstract

Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters, and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The ideas of the paper are central, yet deep, and well articulated by the authors - To my knowledge, the contributions are novel in the literature and interestingly meld together the ideas of Bayesian ensembles of models with variational inference and optimization. I find the takeaway message of the paper to be quite “unifying”. - Extensibility to two different problem settings (regression and classification) augments the impact of the work. - The authors provide sufficient background to make

Weaknesses

- The selling point of the paper falls short of the results: deep learning is not included in the main results of S4, which assume overparameterized linear models. I still find the contribution interesting even in this context, but I’m unsure how much the results can be said to explain the success of “deep learning”. I understand the discussion on page 5 and the CIFAR experiment tie it more closely to practical deep learning, but the theoretical results do not explain the success of deep learnin

Reviewer 02Rating 8Confidence 3

Strengths

- The paper is well written and structured. - The authors propose to training a Bayesian neural network using only the expected loss, removing the divergence term from the variational objective. They then show that among the variational parameters (assumed to follow a Gaussian distribution) that minimize the expected loss, SGD converges to the one closest to the prior in terms of Wasserstein-2 distance -- analogous to how SGD converges to minimum-norm solution in overparameterised linear network

Weaknesses

- The theoretical analysis is limited to overparameterized linear model (one layer). Extending the proof to more than 2 layers might not be straightforward. - The proof assumes Gaussian variational distribution, and the theorems do not generalize to other distributions which might be more effective for other problem settings/ datasets.

Reviewer 03Rating 6Confidence 5

Strengths

1. Very sound approach and presentation, very much like the simplicity of just not having an extra divergence term to deal with the numerical issues of. 2. Theorem 1 is very nice, gives an intuition as to why this could work for much larger models. 3. Experiments convincing, comprehensive across diverse tasks and datasets

Weaknesses

1. This method seems to be possibly limited to mean field variational posteriors, and only considers the Wasserstein distance as the divergence term. 2. Missing baseline comparison, would be nice if you could compare against a variational method that uses Wasserstein distance explicitly in the "ELBO", possibly https://proceedings.neurips.cc/paper_files/paper/2022/file/18210aa6209b9adfc97b8c17c3741d95-Paper-Conference.pdf or any other paper that proposes Wasserstein distance explicitly as a dive

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition

MethodsVariational Inference