AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization
Wendyam Eric Lionel Ilboudo, Taisuke Kobayashi, Takamitsu Matsubara

TL;DR
AdaTerm introduces a unified, noise-adaptive stochastic gradient optimizer based on the Student's t-distribution, improving robustness and performance in noisy deep learning scenarios.
Contribution
It is the first to incorporate all associated statistics of the Student's t-distribution into the optimization process, enhancing robustness and reducing hyperparameters.
Findings
Demonstrates superior robustness across various noisy datasets.
Achieves better learning performance compared to existing optimizers.
Provides a new theoretical regret bound independent of AMSGrad.
Abstract
With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Neural Network Applications
MethodsAMSGrad · Stochastic Gradient Descent
