SOREL: A Stochastic Algorithm for Spectral Risks Minimization
Yuze Ge, Rujun Jiang

TL;DR
SOREL is a novel stochastic gradient algorithm with convergence guarantees for spectral risk minimization, effectively balancing average and worst-case performance in machine learning models.
Contribution
It introduces the first stochastic gradient-based method with proven convergence for spectral risk minimization, improving upon previous methods lacking such guarantees.
Findings
Achieves near-optimal convergence rate of $ ilde{O}(1/\sqrt{\epsilon})$
Outperforms existing algorithms in runtime and sample complexity
Demonstrates effectiveness on real datasets
Abstract
The spectral risk has wide applications in machine learning, especially in real-world decision-making, where people are not only concerned with models' average performance. By assigning different weights to the losses of different sample points, rather than the same weights as in the empirical risk, it allows the model's performance to lie between the average performance and the worst-case performance. In this paper, we propose SOREL, the first stochastic gradient-based algorithm with convergence guarantees for the spectral risk minimization. Previous algorithms often consider adding a strongly concave function to smooth the spectral risk, thus lacking convergence guarantees for the original spectral risk. We theoretically prove that our algorithm achieves a near-optimal rate of in terms of . Experiments on real datasets show that our…
Peer Reviews
Decision·ICLR 2025 Poster
This paper is well-written and easy to follow. The claimed contribution is interesting to the community. However, I have a major concern regarding its correctness which I elaborate on in the next section.
I didn't read the proofs closely but the main theorem seems to be incorrect by a sanity check. To be more specific, $\mu$ is in unit loss/par^2 and $\delta_k$ is in unit loss, but $\delta_k \sim \mu$ in Theorem 1. Moreover, $G$ is in unit loss/par and $\epsilon$ is in unit par^2, so $G/(\mu \sqrt{\epsilon})$ is unitless. However, there is $\log{(G/(\mu^2 \sqrt{\epsilon}))}$ in the sample complexity given in Cor. 1 which makes it not unitless. Other comments: 1. I do not see error bars (standard
This paper sets out an interesting problem that has some (at least distant) applications to relevant areas in ML including risk sensitivity, fairness, and generalization OOD. The proposed solution of reducing to a minimax game and then running a primal-dual algorithm is interesting and the empirical validation uses a reasonable diversity of datasets and loss functions that suffices to convince me that the proposed algorithm is superior to alternatives, at least when considering ridge regression
There are several main weaknesses with the paper: 1. With respect to the theory, I think more discussion of the strongly convex regularizer is necessary. I understand that it is necessary in order to ensure identifiability, but I find the discussion under Corollary 1 confusing. For example, in the comparison to related work, the authors cite several works that set the regularization term to be $O(\epsilon)$ in order to be $\epsilon$-suboptimal with respect to the unregularized solution and no
- The paper points out an important fact about the analysis of algorithms for "smoothed" spectral risk measures, which is that when the smoothing parameter $\nu = O(\epsilon)$, it should be included in the complexity guarantee, making "linearly convergent" methods sublinear for the original non-smooth problem. - By using the experimental benchmark of [Mehta et al. (2024)](https://arxiv.org/pdf/2310.13863), the authors are able to perform head-to-head comparisons to recent algorithms designed fo
Upon viewing Corollary 1, we see that the complexity contains the term $\log(\frac{\sqrt{n}G}{\mu^2\sqrt{\epsilon}})$, i.e. a logarithm is taken for a quantity that is *not* unitless. This does not pass a basic sanity check for convergence analyses in optimization theory: that the complexity does not depend on the units of the input. The error comes from the fact that the precision parameter $\delta_k$ in the inner loop is dependent on the strong convexity parameter $\mu$ (units of loss/inputs$^
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Neural Networks and Applications
