Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization
Corrado Coppola, Lorenzo Papa, Irene Amerini, Laura Palagi

TL;DR
This paper introduces F-CMA, a novel mini-batch optimization algorithm with line-search and convergence guarantees, significantly improving training speed and accuracy for deep learning models.
Contribution
F-CMA is a new optimization method that overcomes adaptive gradient limitations with a line-search approach and proven convergence, enhancing deep learning training efficiency.
Findings
Training time reduced by up to 68%
Per-epoch efficiency increased by up to 20%
Model accuracy improved by up to 5%
Abstract
Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research · Neural Networks and Applications
MethodsLinear Layer · Residual Connection · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer
