CAdam: Confidence-Based Optimization for Online Learning
Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Di Wang, Jie Jiang, Jian Li

TL;DR
CAdam is a confidence-based optimizer designed for online learning that improves adaptation to distribution shifts and noise, outperforming Adam in recommendation systems and live A/B testing.
Contribution
It introduces a confidence mechanism to selectively update parameters, enhancing Adam's performance under distribution shifts and noisy data in online learning.
Findings
CAdam outperforms Adam in distribution shift scenarios.
CAdam improves recommendation system metrics in live A/B tests.
CAdam increases gross merchandise volume in real-world applications.
Abstract
Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum () and adaptive learning rate (). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noise, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on…
Peer Reviews
Decision·Submitted to ICLR 2026
Originality: The paper introduces a per‑coordinate confidence gate that updates a parameter only when the sign of the momentum and the current gradient agree, implemented as a one‑line mask $\hat{m}_t \leftarrow \hat{m}_t \odot \mathbb{1}[m_t \odot g_t > 0]$ in Algorithm 1 (line 14). This reframes gradient–momentum alignment as a practical proxy for confidence—an elegant, minimal change that preserves Adam's structure while directly targeting online non‑stationarity and noisy feedback. The intu
Limited Theoretical Novelty: - While the paper provides a sound convergence analysis, it largely adapts existing frameworks (e.g., Li et al., 2023) with minimal theoretical innovation. The confidence-based masking mechanism—though intuitively appealing—is described as a binary gating function based on the sign of gradient–momentum alignment, which may oversimplify real-world uncertainty. A more rigorous analysis (e.g., probabilistic confidence modeling or adaptive thresholds) could strengthen t
- The paper addresses a realistic problem in online learning: Adam often struggles when data distributions shift or when noisy labels appear. This is an important setting in recommender systems and streaming learning, and the motivation is easy to follow. - The authors report long-term deployment in production, with multiple online scenarios and stable performance improvements. This kind of evidence is rare in optimizer papers and adds credibility to the practical value of the method.
- The core change is to skip updates when momentum and gradient disagree in sign. This is a small modification that resembles earlier ideas such as AdaBelief or Cautious Optimizers, which also adjust step sizes based on confidence in the current gradient. The paper presents a clear engineering improvement, but not a conceptual breakthrough. - The convergence proof is almost a direct adaptation of *Li et al. (2023)*, which already established convergence under relaxed smoothness assumptions. The
$\textbf{Clear motivation and simple implementation.}$ The idea of incorporating a “confidence” check based on the agreement between momentum and gradient is intuitive, easy to code, and compatible with existing Adam variants (AMSGrad, AdamW). --- $\textbf{Readable presentation.}$ The paper is well structured, includes algorithm pseudocode, and connects clearly to known Adam literature.
$\textbf{Heuristic motivation and lack of rigorous insight.}$ The proposed “confidence mask” is largely an intuitive rule rather than a principled optimization mechanism. Adam’s momentum disagreement with the current gradient is common and often necessary for escaping noise or curvature effects. The paper treats this as an error signal, but in nonconvex stochastic regimes, sign disagreement is expected and not inherently harmful. --- $\textbf{Potential non-convergence under noise.}$ The meth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsAdam
