Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Rudrajit Das; Neel Patel; Meisam Razaviyayn; and Vahab Mirrokni

arXiv:2602.19510·cs.LG·February 24, 2026

Less is More: Convergence Benefits of Fewer Data Weight Updates over Longer Horizon

Rudrajit Das, Neel Patel, Meisam Razaviyayn, and Vahab Mirrokni

PDF

Open Access

TL;DR

This paper analyzes the convergence of data mixing in bilevel optimization for training robust models, revealing that using a single inner update often fails and that optimal inner steps scale logarithmically with the total update budget.

Contribution

The paper provides a rigorous theoretical analysis of the convergence behavior of data mixing with finite inner steps, establishing optimal scaling laws for the number of inner updates.

Findings

01

Using a single inner update ($T=1$) can fail in simple cases.

02

Optimal number of inner steps scales as $ heta( ext{log } N)$ with total update budget.

03

Theoretical results are supported by proof-of-concept experiments.

Abstract

Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$ . We prove that the "greedy" practical approach of using $T = 1$ can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Advanced Bandit Algorithms Research