Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Mohammed Djameleddine Belgoumri; Mohamed Reda Bouadjenek; Hakim Hacid; Imran Razzak; Sunil Aryal

arXiv:2505.19527·cs.LG·October 27, 2025

Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Mohammed Djameleddine Belgoumri, Mohamed Reda Bouadjenek, Hakim Hacid, Imran Razzak, Sunil Aryal

PDF

Open Access 3 Reviews

TL;DR

The paper introduces the Rolling Ball Optimizer (RBO), a novel method that smooths the loss landscape by simulating a sphere rolling over it, improving optimization and generalization in neural network training.

Contribution

It proposes a new optimization algorithm that incorporates large-scale landscape information, providing a smoothing effect and better handling of complex loss geometries.

Findings

01

RBO demonstrates faster convergence compared to SGD, SAM, and Entropy-SGD.

02

RBO achieves improved training accuracy on MNIST and CIFAR datasets.

03

RBO enhances generalization performance in neural network training.

Abstract

Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The idea, or especially its implementation, seem novel and yet intuitive. The explanations for why it might work also seem to pass muster (learning rate and the radius phase transition).

Weaknesses

Some of the motivation in the abstract and intro feels like overselling the problem, e.g., for a while it was believed that local minima might simply not exist in neural networks; see https://arxiv.org/pdf/1910.00359. I would like to know _how_ much more computationally expensive this is; my intuition about doing the projections says "much more than SAM", which doesn't bode too well given they were trading blows in Table 1. In any case, that concern makes me want for some compute-matched experi

Reviewer 02Rating 4Confidence 3

Strengths

- Originality (concept): Replaces point‑particle dynamics with finite‑radius body dynamics; non‑locality emerges from a projection onto the graph’s offset. This is a clean, physically motivated design space distinct from SAM/Entropy‑SGD. Fig. 2 (p. 4) compellingly visualizes multi‑scale smoothing as $\rho$ increases. - Quality (math framing): The offset‑manifold viewpoint and the weak/linear ironing results formalize the smoothing intuition; the unreachability proposition links sharpness to c

Weaknesses

1. **Metric & scaling are not specified or analyzed.** The projection minimizes Euclidean distance in $\mathbb{R}^{d+1}$ between $\tilde c_{t+1}$ and points on the graph $\{(\theta,f(\theta))\}$ (Eq. (3)), which implicitly equates horizontal parameter units and the vertical loss scale. Without a scaling parameter$\lambda$ to balance $\|\theta-\theta_e\|^2 + \lambda^2 (f(\theta)-y_e)^2$, behavior can change drastically under simple transformations (e.g., multiplying the loss by a constant) or par

Reviewer 03Rating 2Confidence 3

Strengths

- This paper proposes a novel optimization method with intuitive idea of "rolling ball" which can be beneficial not only for optimization and also for generalization.

Weaknesses

- The crucial weakness is the experiment parts. - ResNet-6 and VGG-9 are too small. It would be much better if it is scalable to larger neural networks (e.g., WRN-28-10). - The accuracy reported in Table 1 is very far from the state-of-the-arts. It doesn't need to be the state-of-the-arts, but at least, CIFAR-10 performance should be over/around 90% (SAM achieved >97% performance according to the SAM paper). It's unclear whether the hyperparameters of SAM are well-tuned for the small neu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Locomotion and Control