ADAM Optimization with Adaptive Batch Selection

Gyu Yeol Kim; Min-hwan Oh

arXiv:2512.06795·stat.ML·December 9, 2025

ADAM Optimization with Adaptive Batch Selection

Gyu Yeol Kim, Min-hwan Oh

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AdamCB, an improved adaptive optimizer that uses combinatorial bandit sampling to enhance convergence and practical performance in neural network training.

Contribution

It integrates combinatorial bandit techniques into Adam, providing better theoretical guarantees and empirical performance over previous adaptive sampling methods.

Findings

01

AdamCB achieves faster convergence than previous Adam variants.

02

Numerical experiments show AdamCB consistently outperforms existing methods.

03

Theoretical analysis confirms improved regret bounds for AdamCB.

Abstract

Adam is a widely used optimizer in neural network training due to its adaptive learning rate. However, because different data samples influence model updates to varying degrees, treating them equally can lead to inefficient convergence. To address this, a prior work proposed adapting the sampling distribution using a bandit framework to select samples adaptively. While promising, the bandit-based variant of Adam suffers from limited theoretical guarantees. In this paper, we introduce Adam with Combinatorial Bandit Sampling (AdamCB), which integrates combinatorial bandit techniques into Adam to resolve these issues. AdamCB is able to fully utilize feedback from multiple samples at once, enhancing both theoretical guarantees and practical performance. Our regret analysis shows that AdamCB achieves faster convergence than Adam-based methods including the previous bandit-based variant.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. This paper rigorously addresses and corrects the theoretical flaws in convergence guarantees for both Adam and AdamBS, presenting refined proofs that offer independent value to the community. 2. The proposed AdamCB algorithm is shown to be both theoretically robust and empirically effective, with rigorous analysis and extensive experimental validation. 3. The paper presents the method and supporting claims clearly, facilitating reader comprehension and enhancing accessibility of the technical

Weaknesses

1. The paper's motivation could benefit from further clarification and depth. In the abstract and introduction, the authors state that uniform sampling leads to inefficiencies in Adam, but they should specify the type of inefficiency (e.g., memory, computational, time, or convergence efficiency; presumably the latter). Additionally, the authors should provide evidence or discussion showing that alternative sampling methods indeed improve Adam's efficiency, strengthening the case for the proposed

Reviewer 02Rating 6Confidence 3

Strengths

The primary strengths of this paper are as follows: - The writing is clear and easy to understand. - The assumptions required for the convergence analysis is more general than previous work and the theory also covers Adam and AdamBS. - The paper identifies and addresses incorrect assumptions made in the analysis of AdamBS.

Weaknesses

- The experiments are fairly limited in scale and not very reflective of practical settings that we are in these days with large models and large datasets. - The benefits of adaptive selection will likely be limited in settings with large model dimensionality since then the $d\sqrt{T}$ term will dominate the $\sqrt{d}/n^{3/4}T^{3/4}$ term controlled by adaptive selection. The potentially marginal improvement of adaptive selection is not discussed as far as I can tell in the main paper (it is di

Reviewer 03Rating 6Confidence 3

Strengths

1. This paper clearly points out the flaw in the proof of previous papers. 2. This paper provides a new convergence rate of Adam/AdamBS in the new perspective. 3. The results on some simple datasets are better than baselines.

Weaknesses

I appreciate the theoretical contribution, and my major concern is the applicability of this method in real-world applications. Since the optimizer is one of the most fundamental components of the entire machine learning process, I wonder if this method can be directly integrated into existing practices. Additionally, I question whether the performance remains robust given the extra effort required to assign sampling bias towards certain samples during batch construction. Please see the question

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification