Distributionally Robust Optimization with Bias and Variance Reduction

Ronak Mehta; Vincent Roulet; Krishna Pillutla; Zaid Harchaoui

arXiv:2310.13863·stat.ML·October 24, 2023·1 cites

Distributionally Robust Optimization with Bias and Variance Reduction

Ronak Mehta, Vincent Roulet, Krishna Pillutla, Zaid Harchaoui

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces Prospect, a stochastic gradient algorithm for distributionally robust optimization that converges faster and with less hyperparameter tuning, improving performance on diverse benchmarks.

Contribution

The paper proposes Prospect, a new DRO algorithm with linear convergence and minimal hyperparameter tuning, outperforming existing methods across multiple domains.

Findings

01

Prospect converges 2-3 times faster than baseline methods.

02

It requires tuning only a single learning rate hyperparameter.

03

The method performs well on distribution shift and fairness benchmarks.

Abstract

We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$ -divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top- $k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3 $\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

The paper has multiple strengths in terms of originality, quality and significance. The clarity component is lacking in the ways that I will explain in the weaknesses section. - The paper has solid contributions compared to prior work. - The theoretical analysis is sound and well supported. - The authors also discuss the case where the hypotheses of this work are violated and argue that their algorithm still converges in that case. - The empirical evaluation considers 3 important problems a

Weaknesses

I enumerate below the weaknesses of this work, which to me are important to address but do not undermine the overall quality of this work. I hope the authors will be able to address them during the rebuttal. - Presentation and clarity: Although the authors clearly attempt to make the paper as clear as possible, some key notions are never introduced. For instance, CVaR was never formally introduced. It is also unclear to me what 0.5-CVaR, 2-extremile, and 1-ESRM really mean mathematically speaki

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

- It seems the strong duality of $f$-divergence DRO when considering spectral risk measure in Proposition 3 is new in the literature. Appendix B shows a range of nice properties regarding the formulation (2). - The designed Prospect Algorithm is novel and operates by reducing the bias and variance of gradient estimators while maintaining small computational costs. Nice convergence guarantees are established in Theorem 1 provided that the regularization for $f$-divergence is lower bounded and th

Weaknesses

See Questions part.

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

**originality**: I think this work is original. This paper develops a new iterative algorithm for DRO problems with ambiguity sets connected to spectral risk measures. **quality**: The work is sound and presents justifies the algorithm with both theoretical and experimental results **clarity**: the work is well presented and the numerical results are clearly explained. **significance**: I believe the work is significant since it expands the type of DRO problems that can be solved with iterat

Weaknesses

I feel the discussion on the connection to DRO is quite limited. The ambiguity sets for equation (2) is the entire space of distributions which is quite large and not very useful. I believe the presence of the penalty term shows that this problem can be equivalent to a tighter ambiguity set (maybe restricted by the divergence metric used in the objective) and it would be good if the authors can discuss this. f-divergences are quite a broad type of divergence as discussed in the appendix. The 3

Videos

Distributionally Robust Optimization with Bias and Variance Reduction· slideslive

Taxonomy

TopicsRisk and Portfolio Optimization · Auction Theory and Applications · Stochastic Gradient Optimization Techniques

MethodsSparse Evolutionary Training