Any-stepsize Gradient Descent for Separable Data under Fenchel-Young Losses
Han Bao, Shinsaku Sakaue, Yuki Takezawa

TL;DR
This paper investigates the convergence of gradient descent with arbitrary stepsizes on separable data using Fenchel-Young losses, revealing the importance of separation margin over self-bounding properties for convergence rates.
Contribution
It extends understanding of GD convergence beyond self-bounding losses by establishing arbitrary-stepsize convergence for Fenchel-Young losses, highlighting the role of separation margin.
Findings
Tsallis entropy achieves a convergence rate of Ω(ε^{-1/2})
Rényi entropy achieves a convergence rate of Ω(ε^{-1/3})
Separation margin, not self-bounding property, influences convergence rates
Abstract
The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStatistical Methods and Inference · Reservoir Engineering and Simulation Methods · Stochastic Gradient Optimization Techniques
