PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning
Arnulf Jentzen, Julian Kranz, Adrian Riekert

TL;DR
PADAM is a novel parallel averaging method for ADAM that dynamically selects the best averaged variant during training, reducing optimization error without extra gradient evaluations, especially effective in scientific machine learning tasks.
Contribution
Introduces PADAM, a parallel averaged ADAM approach that adaptively chooses the optimal averaging variant during training without additional gradient computations.
Findings
PADAM achieves the lowest optimization error in most tested problems.
Effective in scientific machine learning applications like PDEs and control problems.
Requires no more gradient evaluations than standard ADAM.
Abstract
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsStochastic Gradient Descent · Adam
