Debiasing Mini-Batch Quadratics for Applications in Deep Learning

Lukas Tatzel; B\'alint Mucs\'anyi; Osane Hackel; Philipp Hennig

arXiv:2410.14325·cs.LG·January 29, 2025

Debiasing Mini-Batch Quadratics for Applications in Deep Learning

Lukas Tatzel, B\'alint Mucs\'anyi, Osane Hackel, Philipp Hennig

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper investigates the bias introduced by mini-batch approximations of quadratic functions in deep learning, providing theoretical insights and proposing strategies to correct this bias for improved optimization and uncertainty estimation.

Contribution

It identifies and explains the bias in mini-batch quadratic approximations, and develops debiasing methods to enhance second-order optimization and uncertainty quantification.

Findings

01

Bias causes systematic errors in quadratic approximations.

02

Debiasing strategies improve optimization accuracy.

03

Enhanced uncertainty quantification in deep learning models.

Abstract

Quadratic approximations form a fundamental building block of machine learning methods. E.g., second-order optimizers try to find the Newton step into the minimum of a local quadratic proxy to the objective function; and the second-order approximation of a network's loss function can be used to quantify the uncertainty of its outputs via the Laplace approximation. When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches. This, however, distorts and biases the shape of the associated stochastic quadratic approximations in an intricate way with detrimental effects on applications. In this paper, we (i) show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty quantification via the Laplace…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

Authors study a well-known but often overlooked issue in mini-batch approx within second-order optimisations and uncertainty quantification for deep learning: the bias in curvature estimates due to mini-batch sampling. While the issue is not new, the approach of a two-batch debasing strategy is an effective and novel (to my knowledge) way to address it and improve the accuracy of the approximates without suffering too much of a computational overhead. This strategy could generalise well across v

Weaknesses

The experimental scope is pretty limited, with only ablations on CNN architecture and CIFAR-10/100 datasets. Testing on a wider variety of architectures, such as transformers or larger datasets, would strengthen a lot the paper’s claims on general applicability. Another limitations lies in the lack of ablations to investigate the sensitivity of the two-batch approach to mini-batch size, as well as the composition of mini-batches used for debasing. Without these, it’s unclear how the two-batch ba

Reviewer 02Rating 8Confidence 3

Strengths

This is an interesting theoretical investigation collaborated with numerical justifications. It teaches me something that I would expect with a nice and convincing narrative. The overall clarity and language of this paper is good. The insights are well-delivered. The theoretical analyses are mostly backed up by numerical experiments.

Weaknesses

I did not identify any obvious weaknesses. I did not check the proofs.

Reviewer 03Rating 3Confidence 3

Strengths

- A computationally-cheap estimate of 2nd order information is an important topic for optimization - Ample experimental observations - Designing computational-cheap experiments to study 2nd order derivative of training loss of neural networks

Weaknesses

I believe the notion of unbiased estimate is not well defined and studied. Define the Hessian associate with full-batch training loss as $$H(x) = \frac{1}{n} \sum_{i=1}^n \nabla^2 f_i(x)$$. Similarly, we define the unbiased estimate of this matrix as $$H'(x) = \frac{1}{m} \sum_{k=1}^m \nabla^2 f_{i_k}(x)$$ where $i_k$ are uniformly drawn from $\{1,\dots, n\}$. For each fixed vector $d$, we have $$E d^\top H' d^\top = d^\top H d$$. Thus, the **directional curvature**, defined in the paper, is u

Videos

Debiasing Mini-Batch Quadratics for Applications in Deep Learning· slideslive

Taxonomy

TopicsElectromagnetic Scattering and Analysis · Matrix Theory and Algorithms

MethodsSparse Evolutionary Training