Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Dmitrii Feoktistov; Igor Ignashin; Andrey Veprikov; Nikita Borovko; Alexander Bogdanov; Savelii Chezhegov; Aleksandr Beznosikov

arXiv:2508.16734·cs.LG·September 26, 2025

Aligning Distributionally Robust Optimization with Practical Deep Learning Needs

Dmitrii Feoktistov, Igor Ignashin, Andrey Veprikov, Nikita Borovko, Alexander Bogdanov, Savelii Chezhegov, Aleksandr Beznosikov

PDF

3 Reviews

TL;DR

This paper introduces ALSO, an adaptive optimizer that aligns distributionally robust optimization with practical deep learning needs, allowing group-based weight assignment and demonstrating superior performance across various tasks.

Contribution

The paper proposes ALSO, a novel adaptive optimizer for a modified DRO objective that supports group-based weighting and proves its convergence for non-convex deep learning models.

Findings

01

ALSO outperforms traditional optimizers in diverse DL tasks.

02

The optimizer effectively handles group-based weight assignment.

03

Convergence is proven for non-convex objectives.

Abstract

While traditional Deep Learning (DL) optimization methods treat all training samples equally, Distributionally Robust Optimization (DRO) adaptively assigns importance weights to different samples. However, a significant gap exists between DRO and current DL practices. Modern DL optimizers require adaptivity and the ability to handle stochastic gradients, as these methods demonstrate superior performance. Additionally, for practical applications, a method should allow weight assignment not only to individual samples, but also to groups of objects (for example, all samples of the same class). This paper aims to bridge this gap by introducing ALSO $\unicode x 2013$ Adaptive Loss Scaling Optimizer $\unicode x 2013$ an adaptive algorithm for a modified DRO objective that can handle weight assignment to sample groups. We prove the convergence of our proposed algorithm for non-convex…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 2

Strengths

- Well-motivated bridge from theory to practice. The problem framing correctly diagnoses the friction between existing DRO methods and DL practice (non-convexity, Adam-style training, batching, grouping). - Simple, drop-in algorithm with practical details. Algorithm 1 is easy to implement ; the $\pi$ update (Option I) is a one-line mirror step (softmax on a shifted log-$\pi$), and the method slots into standard mini-batch training. - Broad, convincing empirical coverage. Five diverse heterogen

Weaknesses

- Mismatch between DRO objective and evaluation metrics. Across settings, evaluation often reports mean metrics (e.g., average accuracy over attacks in Section 5.3/Figure 2; overall F1 in Section 5.1/Figure 1), while DRO is about worst-case (or tail) risk. - Assumptions 4.1–4.3 (L-smoothness, Lipschitzness, unbiased variance-bounded oracles) and Theorem 4.5 adopt $\beta_2$ and batch size scalings tied to $\epsilon$ (e.g., $\beta_2=1-\epsilon^2$). It’s not clear how these map to default hyperpa

Reviewer 02Rating 6Confidence 3

Strengths

- The algorithm ends up being rather simple. It is also quite intuitive from previous work. Adam can easily be replaced by other algorithms. - The theoretical proof may not reflect the actual practice (assumptions probably are not right for deep networks) but it still demonstrates the overall viability of the approach. - The authors present 5 sets of experiments illustrating the relevance of the approach. - The appendix presents numerous additional ablation studies

Weaknesses

- On several experiments the gains of the method are quite small compared to Adam. - The theoretical analysis may not help guide the practice since the assumptions may not match the reality of the tasks.

Reviewer 03Rating 0Confidence 5

Strengths

1. The effectiveness of DRO has been verified on different tasks including on unbalanced data, tabular data, robust training under adversarial attacks, distributed training and split learning. 2. The convergence results matches some literature in non-adaptive minimax optimization, though I will discuss some concerns later.

Weaknesses

1. The contribution may be limited since it seems that the problem of concern has already been well studied in the literature. Particularly, the following highly-related literatures have been missed: [1] Guo, Zhishuai, et al. "Unified convergence analysis for adaptive optimization with moving average estimator." arXiv preprint arXiv:2104.14840 (2021). [2] Guo, Zhishuai, and Tianbao Yang. "Communication-efficient federated group distributionally robust optimization." Advances in Neural Informat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.