SOREL: A Stochastic Algorithm for Spectral Risks Minimization

Yuze Ge; Rujun Jiang

arXiv:2407.14618·math.OC·July 23, 2024

SOREL: A Stochastic Algorithm for Spectral Risks Minimization

Yuze Ge, Rujun Jiang

PDF

Open Access 3 Reviews

TL;DR

SOREL is a novel stochastic gradient algorithm with convergence guarantees for spectral risk minimization, effectively balancing average and worst-case performance in machine learning models.

Contribution

It introduces the first stochastic gradient-based method with proven convergence for spectral risk minimization, improving upon previous methods lacking such guarantees.

Findings

01

Achieves near-optimal convergence rate of $ ilde{O}(1/\sqrt{\epsilon})$

02

Outperforms existing algorithms in runtime and sample complexity

03

Demonstrates effectiveness on real datasets

Abstract

The spectral risk has wide applications in machine learning, especially in real-world decision-making, where people are not only concerned with models' average performance. By assigning different weights to the losses of different sample points, rather than the same weights as in the empirical risk, it allows the model's performance to lie between the average performance and the worst-case performance. In this paper, we propose SOREL, the first stochastic gradient-based algorithm with convergence guarantees for the spectral risk minimization. Previous algorithms often consider adding a strongly concave function to smooth the spectral risk, thus lacking convergence guarantees for the original spectral risk. We theoretically prove that our algorithm achieves a near-optimal rate of $O (1/ ϵ)$ in terms of $ϵ$ . Experiments on real datasets show that our…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

This paper is well-written and easy to follow. The claimed contribution is interesting to the community. However, I have a major concern regarding its correctness which I elaborate on in the next section.

Weaknesses

I didn't read the proofs closely but the main theorem seems to be incorrect by a sanity check. To be more specific, $\mu$ is in unit loss/par^2 and $\delta_k$ is in unit loss, but $\delta_k \sim \mu$ in Theorem 1. Moreover, $G$ is in unit loss/par and $\epsilon$ is in unit par^2, so $G/(\mu \sqrt{\epsilon})$ is unitless. However, there is $\log{(G/(\mu^2 \sqrt{\epsilon}))}$ in the sample complexity given in Cor. 1 which makes it not unitless. Other comments: 1. I do not see error bars (standard

Reviewer 02Rating 6Confidence 3

Strengths

This paper sets out an interesting problem that has some (at least distant) applications to relevant areas in ML including risk sensitivity, fairness, and generalization OOD. The proposed solution of reducing to a minimax game and then running a primal-dual algorithm is interesting and the empirical validation uses a reasonable diversity of datasets and loss functions that suffices to convince me that the proposed algorithm is superior to alternatives, at least when considering ridge regression

Weaknesses

There are several main weaknesses with the paper: 1. With respect to the theory, I think more discussion of the strongly convex regularizer is necessary. I understand that it is necessary in order to ensure identifiability, but I find the discussion under Corollary 1 confusing. For example, in the comparison to related work, the authors cite several works that set the regularization term to be $O(\epsilon)$ in order to be $\epsilon$-suboptimal with respect to the unregularized solution and no

Reviewer 03Rating 6Confidence 5

Strengths

- The paper points out an important fact about the analysis of algorithms for "smoothed" spectral risk measures, which is that when the smoothing parameter $\nu = O(\epsilon)$, it should be included in the complexity guarantee, making "linearly convergent" methods sublinear for the original non-smooth problem. - By using the experimental benchmark of [Mehta et al. (2024)](https://arxiv.org/pdf/2310.13863), the authors are able to perform head-to-head comparisons to recent algorithms designed fo

Weaknesses

Upon viewing Corollary 1, we see that the complexity contains the term $\log(\frac{\sqrt{n}G}{\mu^2\sqrt{\epsilon}})$, i.e. a logarithm is taken for a quantity that is *not* unitless. This does not pass a basic sanity check for convergence analyses in optimization theory: that the complexity does not depend on the units of the input. The error comes from the fact that the precision parameter $\delta_k$ in the inner loop is dependent on the strong convexity parameter $\mu$ (units of loss/inputs$^

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Neural Networks and Applications