CAdam: Confidence-Based Optimization for Online Learning

Shaowen Wang; Anan Liu; Jian Xiao; Huan Liu; Yuekui Yang; Cong Xu; Qianqian Pu; Suncong Zheng; Wei Zhang; Di Wang; Jie Jiang; Jian Li

arXiv:2411.19647·cs.LG·June 5, 2025

CAdam: Confidence-Based Optimization for Online Learning

Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Di Wang, Jie Jiang, Jian Li

PDF

Open Access 3 Reviews

TL;DR

CAdam is a confidence-based optimizer designed for online learning that improves adaptation to distribution shifts and noise, outperforming Adam in recommendation systems and live A/B testing.

Contribution

It introduces a confidence mechanism to selectively update parameters, enhancing Adam's performance under distribution shifts and noisy data in online learning.

Findings

01

CAdam outperforms Adam in distribution shift scenarios.

02

CAdam improves recommendation system metrics in live A/B tests.

03

CAdam increases gross merchandise volume in real-world applications.

Abstract

Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ( $m_{t}$ ) and adaptive learning rate ( $v_{t}$ ). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noise, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

Originality: The paper introduces a per‑coordinate confidence gate that updates a parameter only when the sign of the momentum and the current gradient agree, implemented as a one‑line mask $\hat{m}_t \leftarrow \hat{m}_t \odot \mathbb{1}[m_t \odot g_t > 0]$ in Algorithm 1 (line 14). This reframes gradient–momentum alignment as a practical proxy for confidence—an elegant, minimal change that preserves Adam's structure while directly targeting online non‑stationarity and noisy feedback. The intu

Weaknesses

Limited Theoretical Novelty: - While the paper provides a sound convergence analysis, it largely adapts existing frameworks (e.g., Li et al., 2023) with minimal theoretical innovation. The confidence-based masking mechanism—though intuitively appealing—is described as a binary gating function based on the sign of gradient–momentum alignment, which may oversimplify real-world uncertainty. A more rigorous analysis (e.g., probabilistic confidence modeling or adaptive thresholds) could strengthen t

Reviewer 02Rating 2Confidence 3

Strengths

- The paper addresses a realistic problem in online learning: Adam often struggles when data distributions shift or when noisy labels appear. This is an important setting in recommender systems and streaming learning, and the motivation is easy to follow. - The authors report long-term deployment in production, with multiple online scenarios and stable performance improvements. This kind of evidence is rare in optimizer papers and adds credibility to the practical value of the method.

Weaknesses

- The core change is to skip updates when momentum and gradient disagree in sign. This is a small modification that resembles earlier ideas such as AdaBelief or Cautious Optimizers, which also adjust step sizes based on confidence in the current gradient. The paper presents a clear engineering improvement, but not a conceptual breakthrough. - The convergence proof is almost a direct adaptation of *Li et al. (2023)*, which already established convergence under relaxed smoothness assumptions. The

Reviewer 03Rating 2Confidence 4

Strengths

$\textbf{Clear motivation and simple implementation.}$ The idea of incorporating a “confidence” check based on the agreement between momentum and gradient is intuitive, easy to code, and compatible with existing Adam variants (AMSGrad, AdamW). --- $\textbf{Readable presentation.}$ The paper is well structured, includes algorithm pseudocode, and connects clearly to known Adam literature.

Weaknesses

$\textbf{Heuristic motivation and lack of rigorous insight.}$ The proposed “confidence mask” is largely an intuitive rule rather than a principled optimization mechanism. Adam’s momentum disagreement with the current gradient is common and often necessary for escaping noise or curvature effects. The paper treats this as an error signal, but in nonconvex stochastic regimes, sign disagreement is expected and not inherently harmful. --- $\textbf{Potential non-convergence under noise.}$ The meth

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsAdam