Efficient Continual Finite-Sum Minimization

Ioannis Mavrothalassitis; Stratis Skoulakis; Leello Tadesse Dadi,; Volkan Cevher

arXiv:2406.04731·math.OC·June 10, 2024

Efficient Continual Finite-Sum Minimization

Ioannis Mavrothalassitis, Stratis Skoulakis, Leello Tadesse Dadi,, Volkan Cevher

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a novel continual finite-sum minimization problem and proposes an efficient first-order stochastic variance reduction method, achieving near-optimal complexity improvements over existing algorithms.

Contribution

It formulates the continual finite-sum minimization problem and develops a new algorithm with significantly improved gradient complexity bounds.

Findings

01

The proposed CSVRG method achieves $ ilde{O}(n/ ext{epsilon}^{1/3} + 1/ ext{sqrt{epsilon}})$ gradient complexity.

02

It outperforms traditional SGD and state-of-the-art variance reduction methods like Katyusha.

03

The method's complexity is nearly tight, with lower bounds established for first-order methods.

Abstract

Given a sequence of functions $f_{1}, \dots, f_{n}$ with $f_{i} : D \mapsto R$ , finite-sum minimization seeks a point $x^{⋆} \in D$ minimizing $\sum_{j = 1}^{n} f_{j} (x) / n$ . In this work, we propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization, that asks for a sequence of points $x_{1}^{⋆}, \dots, x_{n}^{⋆} \in D$ such that each $x_{i}^{⋆} \in D$ minimizes the prefix-sum $\sum_{j = 1}^{i} f_{j} (x) / i$ . Assuming that each prefix-sum is strongly convex, we develop a first-order continual stochastic variance reduction gradient method ( $CSVRG$ ) producing an $ϵ$ -optimal sequence with $\tilde{O} (n / ϵ^{1/3} + 1/ ϵ)$ overall first-order oracles (FO). An FO corresponds to the computation of a single gradient $\nabla f_{j} (x)$ at a given $x \in D$ for some $j \in…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

A clearly written paper with sound results.

Weaknesses

I have the following concerns about the paper: -- I can’t connect Problem 2 with its motivation. For example, you say that “it is important that a model is constantly updated so as to perform equally well both on the past and the new data”, but: 1) Problem 1 achieves precisely that; 2) in Problem 2, you train *multiple* models, with later models performing well on all data, and with older models not taking into account new data at all. To conclude, I don’t see a motivation for the problem. --

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

The introduction of the problem is a nice conceptual contribution. The new algorithm they proposed could also have other applications. In the notion of a "natural algorithm" is a nice contribution since it allows the analysis of algorithms and lower bounds.

Weaknesses

A weakness is the lack of intuition about their algorithm. I mostly follow the math, however conceptually I do not know why exactly they can get an improvement in the epsilon power. It seems like the high level idea is only to compute a gradient update if we have not had an update for a long time. Establishing that the gradient is unbiased seems fairly straightforward: it just uses the fact that the gradient at the next step is a linear combination of the new function and the previous gradient

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

- The studied problem is well-motivated, both from the literature review on incremental learning, and from empirical estimation. - The theoretical analysis is complete, an upper bound and lower bound is provided for this problem, as well as compared to state of the art as in table 1. - The logic of this paper is easy to follow, and the assumptions/notations are presented in a clear way. - The paper has additional experiments on the ridge regression problem.

Weaknesses

- The upper bound provided by the algorithm is not tight compared to the lower bound. - The algorithm only work in the strongly convex case. Minor Issue: - there is no input in the algorithm 2. - the value of $\alpha$ needs to be in line 2 of algorithm 1. - additional "the" in the second line of the first paragraph of section 3.1 - the VR is never defined. suggestion: "variance reduction(VR)" and then use VR afterwards

Videos

Efficient Continual Finite-Sum Minimization· slideslive

Taxonomy

TopicsDigital Filter Design and Implementation · Advanced Numerical Analysis Techniques · Numerical Methods and Algorithms