Online Convex Optimization with Perturbed Constraints
V\'ictor Valls, George Iosifidis, Douglas J. Leith, Leandros Tassiulas

TL;DR
This paper introduces a primal-dual proximal gradient algorithm for online convex optimization with time-varying, unknown perturbations in constraints, achieving sublinear regret and constraint violation with adaptive learning rates.
Contribution
It proposes a novel algorithm that balances regret and constraint violation, handling unknown, non-i.i.d. perturbations with adaptive learning rates.
Findings
Achieves $O(T^ hreshold)$ regret and $O(T^ hreshold)$ constraint violation.
Defines regret with a time-varying set of best fixed decisions.
Supports any time horizon with adaptive learning rates.
Abstract
This paper addresses Online Convex Optimization (OCO) problems where the constraints have additive perturbations that (i) vary over time and (ii) are not known at the time to make a decision. Perturbations may not be i.i.d. generated and can be used to model a time-varying budget or commodity in resource allocation problems. The problem is to design a policy that obtains sublinear regret while ensuring that the constraints are satisfied on average. To solve this problem, we present a primal-dual proximal gradient algorithm that has regret and constraint violation, where is a parameter in the learning rate. Our results match the bounds of previous work on OCO with time-varying constraints when ; however, we (i) define the regret using a time-varying set of best fixed decisions; (ii) can balance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Advanced Wireless Network Optimization
Online Convex Optimization with Perturbed Constraints
Víctor Valls
Department of Electrical Engineering and Institute for Network Science, Yale University
George Iosifidis
School of Computer Science and Statistics, Trinity College Dublin
Douglas J. Leith
School of Computer Science and Statistics, Trinity College Dublin
Leandros Tassiulas
Department of Electrical Engineering and Institute for Network Science, Yale University
Abstract
This paper addresses Online Convex Optimization (OCO) problems where the constraints have additive perturbations that (i) vary over time and (ii) are not known at the time to make a decision. Perturbations may not be i.i.d. generated and can be used to model a time-varying budget or commodity in resource allocation problems. The problem is to design a policy that obtains sublinear regret while ensuring that the constraints are satisfied on average. To solve this problem, we present a primal-dual proximal gradient algorithm that has regret and constraint violation, where is a parameter in the learning rate. Our results match the bounds of previous work on OCO with time-varying constraints when ; however, we (i) define the regret using a time-varying set of best fixed decisions; (ii) can balance between regret and constraint violation; and (iii) use an adaptive learning rate that allows us to run the algorithm for any time horizon.
1 Introduction
The Online Convex Optimization (OCO) framework was introduced in [Zin03] and it is widely used to model applications such as spam filtering, portfolio selection, recommendation systems, among many others [Haz16]. In short, OCO consists of a sequence of games where in each round an agent selects an action from a convex set and suffers a cost , where is convex. Crucially, the cost function is not known at the time of making a decision, and it may even be selected by an adversary after the action has been played. The goal is to design a policy or algorithm that selects a sequence of actions , from so that the regret
[TABLE]
increases sublinearly, i.e., . Hence, the incurred cost is asymptotically as good as the best fixed decision in hindsight.111The regret captures the difference between the incurred cost and the cost obtained by an “offline” algorithm that has knowledge of all the cost functions from . The decision of the offline algorithm is, however, more restricted, as it can only choose one vector from .
[Zin03] showed that the online gradient descent (OGD) algorithm can obtain sublinear regret when the action set is bounded and the convex cost functions , have bounded subgradients. The algorithm consists of update
[TABLE]
where is the learning rate (or, step size), a subgradient of the previous cost function at , and the Euclidean projection onto the convex set . Note from Eq. (2) that action is selected using only information available at time .
1.1 OCO with long-term (LT) constraints
In the standard OCO setting, set encompasses all the constraints that an online policy and a fixed decision in hindsight must satisfy. However, sometimes it is useful to treat constraints differently depending on whether they are (i) instantaneous constraints that have to be satisfied in every iteration, or (ii) long-term constraints that must be satisfied only in the long-term or average sense (the formal definition is given in Sec. 2).
The two most prominent reasons for considering long-term constraints are the following. First, more freedom of choice. By not requiring that the long-term constraints are satisfied in every iteration, it is possible to devise policies with specific properties (e.g., lower-complexity per iteration; see Sec. 1.2) and to model new classes of OCO problems. For instance, in wireless communications systems, the power needed to transmit a message is not known a priori (as it depends on the channel conditions, the behavior of other users, etc.) and the goal is to adjust the transmission power to maximize the rate while keeping the average power consumption below a predefined threshold [MTYY09]. That is, it is possible to transmit a message using more power than what it is allowed to use on average as long as the average power consumption constraint is met.
The second reason is that they allow us to handle online constraints. More specifically, constraints that (i) change over time and (ii) are not known by the decision maker at the time of making a decision. For example, in online network flow problems with “offline” constraints (e.g., online shortest path routing [Haz16, pp. 7]) the amount of flow allocated to each of the links has to satisfy the equation in each time ,222See [Roc84, GNT06] for an introduction to modeling different types of network flow problems. where is the routing matrix and a request vector that indicates the supply/demand of flow (of material, traffic, information, etc.) at each of the nodes. With online constraints, we have instead of (i.e., ) and must select without knowledge of . Hence, it is not possible to design an algorithm that guarantees that is equal to in every iteration. An example of this type of constraint is in power allocation problems in data centers where the number of machines that run at a given time () has to be decided before the real workload () can be observed [GIN10].333See [MTYY09, Sec. 8] for a similar example with CPUs. It is important to emphasize that like the cost functions in standard OCO, the perturbation can be a function of the past actions. For instance, in network flow problems [GNT06], the previous resource allocation decisions () affect the quality of experience of the users and therefore, how those generate resource requests . Another example is online display advertising [AD15] where the perturbations represent a budget that varies with time and depends on the past rewards. Or more precisely, past actions affect the rewards and, therefore, the future advertising budget.
1.2 Contributions and related work
In this paper, we consider OCO problems where the long-term constraints have additive perturbations that (i) change over time and (ii) are not known by the decision maker at the time of making a decision . We do not require the perturbations to be i.i.d. or to have any other statistical property (see assumptions in Sec. 3.2). The problem is to design and algorithm that obtains sublinear regret and ensures that the constraints are satisfied on average. The problem addressed is important because non-i.i.d. perturbations can be used to model more accurately a time-varying budget or commodity in online resource allocation problems; for instance, when the actions that the agent made in the past affect the future constraints (i.e., the perturbations).
We solve the OCO problem with perturbed constraints using a novel primal-dual proximal gradient algorithm (Algorithm 1) that obtains regret and accumulated constraint violation, where is a parameter in the learning rate. The best decision in hindsight is selected from a feasible set444i.e., a set where the best fixed decisions satisfies the constraints on average. that changes over time depending on the perturbations. Our algorithm allows us to balance between regret and constraint violation by choosing accordingly. When , our bounds match the best well-known rates [NY17], but we can also obtain faster violation rate than . For example, with , we have regret and constraint violation. Another key characteristic of our algorithm is that the learning rate is adaptive. Hence, we do not need to fix in advance the time the algorithm will run (which is not known in many resource allocation problems). Furthermore, adaptive learning rates are preferable than extending the horizon with the “doubling trick” [CBL06, Sec. 2.3]. Table 1 shows a summary of how our technical approach compares to previous approaches. However, we must emphasize that the problem we address in this paper is fundamentally different from the one considered in previous work; especially, [MJY12, JHA16]. Our motivation for relaxing the constraints is not the complexity of the projection onto the feasible set, but actually that the feasible set is not known and changes over time (because of the perturbations).
Related work: The first works of OCO with long-term constraints were motivated by the complexity of the projection step in OGD. In brief, when set is composed of general convex constraints, the projection step involves solving a convex program that can be computationally burdensome. For example, projections onto the semidefinite cone. Expensive projections are dealt with in offline convex optimization by carrying them out only in the last iteration [MYJ*+*12] or less often [CGP16]. However, such approaches cannot be used in OCO problems as every action played incurs an instantaneous cost. The latter was noted in [MJY12], which formalized the OCO problem with static long-term constraints and proposed two algorithms based on variational inequalities [Nem94]. First, a gradient-based algorithm that obtains regret and constraint violation for general OCO problems; and second, a mirror-prox algorithm that obtains regret and constraint violation when the constraints are polyhedral. The paper in [JHA16] extends the work [MJY12] by proposing an algorithm for general OCO problems with long-term constraints that can balance regret and constraint violation. In particular, the algorithm obtains regret and constraint violation where is a design parameter. Furthermore, and unlike in [MJY12], the learning rate is adaptive, and so the algorithm can run for any time horizon.
Regarding online constraints, the work in [MTYY09] considers online learning problems with constraints that can vary in an arbitrary and possibly adversarial manner. The paper shows that the highest reward-in-hindsight while satisfying the constraints is not attainable in general, except for the case where the feasible set (i.e., the set from which the best fixed decision in hindsight is selected) is convex. The latter result motivated the work in [NY17] to consider OCO problems with convex time-varying constraints (not only with additive perturbations) and proposed an algorithm that obtains regret and constraint violation. However, the performance of the proposed algorithm is compared to the best fixed decision in hindsight that satisfies every time-varying constraint. This is in stark contrast to our work where the feasible set changes over time depending on the perturbations. Furthermore, our feasible set always contains the feasible set in [NY17] (see Fig. 1 and Sec. 2), which means that we compare the cost of our algorithm with a larger set of (possibly better) fixed decisions in hindsight. Hence, the cost of the best fixed decision in hindsight in our work can be smaller than in [NY17]. Finally, [YNW17] considers online constraints that are i.i.d. generated where the feasible set is defined in expectation. The proposed algorithm obtains regret and constraint violation in expectation, and regret and constraint violation bounds that hold for every sample path with probability , . The algorithms in [NY17, YNW17] are based on Lyapunov techniques from stochastic network optimization [Mey08, GNT06]; the learning rate is selected based on the time the algorithm will run; and the iterations have the same complexity than the works in [MJY12, JHA16]. Hence, [NY17, YNW17] generalize and improve the bounds of previous OCO works with static long-term constraints that were motivated by computational complexity of the projections. However, unlike [JHA16], the learning rate is not adaptive.
The rest of the paper is organized as follows. Sec. 2 presents the problem model and Sec. 3 the main technical results. In Sec. 4, we present a numerical example where we show the performance of the proposed algorithm depending on , and compare it to the algorithm in [NY17]. The proof of the main result, Theorem 1, is in the supplementary material.
2 Problem Model
The standard OCO framework can be extended to encompass long-term constraints with additive perturbations as follows. Let be a convex set that contains the admissible or implementable actions, and , be a collection of convex constraints that need to be satisfied on average. Each constraint has an associated perturbation that varies with time. There is no need for the perturbations to be i.i.d., have zero mean, or any other statistical property. The only assumption we will make is that they satisfy a mild Slater condition (see Sec. 3.2). In each round , an agent must select an action without knowledge of the cost function or the perturbation .
We proceed to define the feasible set, the regret, and the constraint violation measure in our problem. To keep notation short, we let and . We define the time-varying feasible set (i.e., the set from which we select the best fixed decision in hindsight) as
[TABLE]
with , and . This is a key difference with previous work where the feasible set is fixed [MJY12] (i.e., ) or satisfies all the constraints [NY17] (i.e., ). Note that the by construction we have that
[TABLE]
where is the set of all the fixed decision that satisfy the constraints on average at time . The exact value used in the definition of will be specified in Theorem 1, and will depend on the sequence of perturbations .
To avoid confusion with the definition of the regret where the feasible set does not change, we define
[TABLE]
Each action contributes to the aggregate constraint violation
[TABLE]
where is the projection of each of the components of vector onto the non-negative orthant. Similar to the regret, we would like that grows at most sublinearly with so that . There is no requirement that for any particular or on the rate at which can grow. Finally, note that if the sum of the penalties inflicted by a constraint is non-positive (i.e., ), then that constraint does not contribute to the aggregate constraint violation .
3 Main Results
3.1 Proposed algorithm and interpretation
The main technical contribution of the paper is Algorithm 1, which allows us to solve the problem presented in Sec. 2 with regret and constraint violation. In short, to handle long-term constraints, we define a Lagrangian-type function
[TABLE]
where is a (sub)gradient of the cost function in the previous round, and a vector of dual variables. To streamline exposition, in the rest of the paper we will refer to simply as Lagrangian. Note that the Lagrangian is convex in , concave in , and that it depends on as the objective function and constraints change in each round. The second term of the Lagrangian can be regarded as a penalty or as an adaptive regularizer that allows us to steer the decisions towards the feasible set .
Algorithm 1 is based on a regularized primal-dual proximal gradient method, where we use the general Bregman divergence as the proximal term instead of the usual squared Euclidean distance; see, for example, [BT03]. Recall the Bregman divergence associated with function is defined as
[TABLE]
where is usually assumed to be -strongly convex function and differentiable. The primal update is equivalent to carrying out an (unconstrained) proximal gradient update with the regularization term . The regularization or penalty term is updated via the dual update (), which can be regarded as applying a standard proximal gradient ascent since for a fixed vector . Interestingly, observe that
[TABLE]
and therefore the primal update is oblivious to perturbation . Hence, the perturbation is only relevant in our algorithm in the update of the dual variables.555This is due to the fact that is the dual variable of the additive perturbation on the constraints. See Sec. D in the supplementary material for more details.
Finally, observe that we use step size equal to with for both updates, so there is no need to fix in advance the time horizon the algorithm will run. Note that when , then the algorithm corresponds to using constant step size . The algorithm’s complexity depends on the structure of the constraints and the functions associated with the Bregman divergence terms in the primal and dual updates. When is linear (e.g., ) and equal to , Algorithm 1 has the same complexity than previous work on OCO with long-term constraints [MJY12, JHA16, YNW17]. In particular, the primal and dual updates can be written as and .
3.2 Assumptions
The following are the necessary assumptions to establish the convergence of the proposed algorithm.
Bounded set.
Set is bounded. There exists a constant such that .
Bounded perturbation.
for all .
Bounded subgradients.
Fix a norm and let denote its dual. There exist constants , , such that , , for all , .
Slater condition.
There exists a such that for an and all .
Bregman functions.
and are -strongly convex and , -smooth. Also, is strictly increasing.
The first assumption is standard in OCO. The second and third assumptions are also standard in OCO and ensure that the subgradients of the of primal and dual updates are bounded. The Slater condition says that there is a set of actions that satisfy the constraints strictly for all , and is key to ensure that the constraint violation is sublinear. Importantly, the Slater condition assumption is mild in many problems. For example, when the perturbation represents the budget available at time , that budget has to be always positive—independently of whether we decide to spend more (i.e., violate the constraint). Finally, the assumption that function and are strongly convex is also standard in the definition of the Bregman divergence. The additional assumption that and are smooth (hence, and are upper and lower bounded by a quadratic function) is to streamline exposition in the proofs.666Technically, all we need is that is uniformly upper bounded for all . Such assumption is also made in previous work and elsewhere to streamline exposition; see, e.g., [MJY12, Lemma 10], [DHS11]. The assumption that is strictly increasing is necessary to obtain the faster rates on the constraint violation when ; note that this is satisfied, for example, by the squared Euclidean distance.
3.3 Bounds and discussion
Theorem 1**.**
Consider Algorithm 1 and suppose the assumptions in Sec. 3.2 are satisfied. For any with , we have
[TABLE]
where is a constant that does not depend on and captures the diameter of the set in which the dual variables are contained. Specifically, where .
Feasible set: The parameter determines the feasible set used in the definition of the regret in Eq. (4). Observe that the condition always holds for , since then and for all (this case corresponds to as in [NY17]). If did not vary over time ( for all ), then and therefore (i.e., for all ). Similarly, if were an i.i.d. random variable with expected value , then and therefore for any . It is difficult to characterize for other cases as this does not depend only on the sequences of perturbations but also on the learning rates (through the dual variables ).
Interpretation of the regret bound: The bound has the same structure than the usual OCO bounds.777See, for example, [Zin03, Theorem 1]. Or more specifically, the regret bound in the second column on page 4, i.e., . For example, for the case where and are the squared Euclidean distance (i.e., ) we have . The first term is related to the size of the sets where the primal and dual variables are contained (i.e., and respectively) and is inversely proportional to the learning rate at time (i.e., ). The second term consists of the bounds on the subgradients of the cost functions () and the constraints () multiplied by .888We write instead of in Theorem 1 as the interesting range is when . The bounds in Theorem 1 are of course useful if the constants are bounded, which is the case for , and by standard OCO assumptions (see Sec. 3.2). However, for constant we need more work. To show that this constant exists is the main technical challenge of the paper; we will this discuss it in detail later in the section.
Our analysis also allow us to recover the standard OCO bound when the constraints are always satisfied. We have the following corollary to Theorem 1.
Corollary 1**.**
Suppose that (i.e., for all and ). The bound on the regret becomes .
That is, when the constraints are always satisfied the dual variables will always be equal to zero and therefore and .999The fact that follows by adding a slack variable to change the inequality constraint to equality, i.e., . Hence, by considering perturbed constraints in the learning problem we are adding to the bound of the standard regret in Corollary 1. Such symmetry is not available in previous works [MJY12, JHA16, NY17, YNW17], and it appears in our work as Algorithm 1 can be regarded, informally, as applying OGD twice (see Lemma 3 in the supplementary material for the technical details). As a result, the constants in the usual OCO bound appear “duplicated”.
Interpretation of the constraint violation bound: The bound on the accumulated constraint violation consists of two terms. The first term is a constant related to the constraints, and the second term depends on constants and , and are divided by the learning rate at time . Hence, if we have that ; however, constant constraint violation comes at the price of the regret not being sublinear. Also, observe that for any in the range , the constraint violation has better rate than the regret.
Constant : This constant is analogous to constant , which measures the maximum distance between any two vectors in the bounded set of primal variables; see Sec. 3.2. However, we cannot define in the same way as the dual variables exist in the nonnegative orthant (which is an unbounded set). Instead, we show that the difference between the vectors generated by the dual update in Algorithm 1 is bounded (not any two vectors in ). Or equivalently, that the sequence of dual variables obtained with Algorithm 1 remains bounded for all ; see Lemma 7 in the supplementary material.
To ensure that is bounded for any , we rely on the Slater condition. In brief, this condition requires that there exists an such that for some scalar , and ensures that the dual variables in Algorithm 1 are attracted to a bounded set within .101010This type of behavior is typical in dual subgradient methods. See, for example, Figure 8.2.6. in (Bertsekas et al. 2003). This is also discussed in detail in [NO09]; see Lemma 1 in [NO09]. And since at is bounded, the sequence of dual variables will remain bounded. The technical challenge is to characterize the diameter of the set to which these dual variables are attracted since unlike standard optimization with a fixed objective function, in OCO the cost functions vary over time and, indirectly, the (bounded) sets to which the dual variables are attracted. See Proposition 2 and discussion in Section B.2 in the supplementary material.
Finally, we note that . Observe that when and are the squared Euclidean distance, gets simplified to where . This last observation implies that .
Constrained convex optimization: Our results can also be applied to constrained optimization problems. The following corollary to Theorem 1 establishes the convergence of a constrained convex program with relaxed constraints and primal averaging.
Corollary 2**.**
Consider the setup of Theorem 1 where the objective function and constraints are constant (i.e., and for all ) and step size with . We have that
[TABLE]
where with and .
The result recovers the upper bound on the objective and constraint violation in Proposition 1 in [NO09] when (fixed step size), but also ensures that and converges to a vector in asymptotically as for any .
4 Numerical Example
We present a variation of the example in [YNW17, Sec. 5] where the actions made by the decision maker affect the constraints. In short, consider a geo-distributed datacenter that consists of a front-end router and clusters distributed in different geographical zones. Jobs arrive in the front-end router and must be scheduled to one of the clusters. The cost of executing jobs depends on the electricity cost of running each of the clusters—which varies across sites as each cluster buys power from its local market. The goal is to schedule jobs to clusters to minimize the total electricity cost while ensuring that all the jobs are served. Importantly, the cost of electricity is not known at the time to schedule jobs. We model the problem above as an OCO as follows. Divide time in slots of equal duration and let be the fraction that each of the clusters is utilized at time . The cost functions are assumed to be linear (i.e., with ) and capture the price of electricity. The constraints are given by , where captures the efficiency of each cluster (i.e., the number of jobs it can serve per time slot) and the jobs that arrive in the front-end router at time .
We run a simulation with clusters, , and . Hence, the jobs that arrive in a time slot depend on the cost in the previous iteration. The simulation results are shown in Fig. 2, where we evaluate the performance of our algorithm with and compare it with the algorithm in [NY17] (indicated in blue). First, observe that by selecting , we can trade regret for constraint violation. In particular, with small , we are sacrificing regret for lower constraint violation—in this example, lower constraint violation means that the jobs have to wait less time at the front-end router before they can be served. When is larger than , we obtain that our algorithm has a surprising behavior that was not observed in previous work: the regret and constraint violation have a sine-wave form where the period of the wave increases with time. The latter is shown in the Fig. 2 with , but we observe the same behavior for other values of larger than . Another interesting observation is the temporal tradeoff between the regret and constraint violation when ; observe from the figure that the peaks of the regret correspond to the lowest values of the constraint violation and vice versa. The lines in light blue show the performance of the algorithm in [NY17], which compares to our algorithm when since then both algorithms have regret and constraint violation.111111To compare the results fairly, we select the best policy in hindsight from (the time-varying feasible set in Eq. (3)) instead of the fixed policy in hindsight that satisfies each constraint individually, i.e., . Observe from the figure that our algorithm has significant smaller regret and constraint violation. This result matches the improvements observed in [JHA16] with respect to previous approaches that used a constant learning rate [MYJ*+*12]. Finally, in Fig. 2c, we show the cost in hindsight depending on whether the best fixed decision is selected from , , or with . Recall that and therefore Observe from the figure that when , we obtain that the costs with and are exactly the same (i.e., the pink dots are exactly on top of the yellow line), whereas when , the costs do not coincide exactly (). Importantly, notice the large difference between the costs when the best fixed decisions in hindsight are selected from and . Finally, we note that the sets change over time and are affected by the actions made the decision maker. Observe that when , the cost of the best fixed decision in hindsight is larger with than with (i.e., ); however, we have the opposite when (i.e., ).
Acknowledgements
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 795244.
Appendices
The supplementary material is divided into four sections. Sec. A contains the preliminaries and background material. In Sec. B, we present the proof of Theorem 1, and in Sec. C the proofs of the lemmas and propositions in Sec. A and B. Finally, in Sec. D, we provide additional background material on Lagrange duality and explain how this connects to our approach.
Appendix A Preliminaries
A.1 Notation
We use , and to denote the set of natural numbers, nonnegative real numbers and -dimensional real vectors. The symbol indicates the all ones column vector—the dimension of the vector will be implied by the context. We use to indicate a subgadient in the subdifferential of at , and to indicate the ’th element in a sequence . For two vectors , we write to indicate that is element-wise larger than or equal to . The inner product between vectors is indicated with .
A.2 Bregman divergence
Let be the Bregman divergence as defined in Eq. (7). We recall the following well-known identity [CT93],
[TABLE]
which is a generalization of the quadratic identify for the Euclidean norm. It is easy to derive Eq. (8) from the definition of the Bregman divergence in Eq. (7). Observe that
[TABLE]
A.3 Online proximal gradient method
Let be a convex function. The following update
[TABLE]
corresponds to the standard proximal method where we have replaced the squared Euclidean distance with the Bregman distance. The following two lemmas are variations of well-known results and correspond to the proximal method and proximal gradient method (or, mirror-descent); see [Van16]. We state them to measure the progress in one iteration.
Lemma 1** (One iteration proximal method).**
Consider the proximal update in Eq. (9) where is a convex function, a convex set (not necessarily bounded), and . For any , the following bound holds
[TABLE]
Obviously, since the bound holds for any , it also holds for the that minimizes . Next, we apply the result in Lemma 1 to the online setting. Let and be two convex functions from and let , where . Function can be regarded as a penalty function or regularizer, and we can ignore it if for all . We will relate with the second term of the Lagrangian in Eq. (6) in the next section.
Lemma 2** (One iteration proximal gradient method).**
Consider the setup of Lemma 1 where in Eq. (9). For any , the following bound holds
[TABLE]
The key point from the last lemma is that by only using a subgradient of we can recover a bound on itself. Also, observe that if we let for all ; index the objective and the step size with (i.e., let and ); and sum from , we can follow the rationale of the proof of Theorem 1 in [Zin03]121212Shown also in the proof of Lemma 3. to recover the standard OCO bound (bound given in Corollary 1).
Appendix B Proof of Theorem 1
This section contains the technical results that support the claims in Sec. 3. It is divided into two parts. In Sec. B.1, we prove a bound on the regret and constraint violation and show how these depend on the boundedness of the dual variables. Sec. B.2 shows that the dual variables remain uniformly bounded with Algorithm 1, which is the main technical challenge of the paper.
B.1 Regret and constraint violation
As explained in Sec. 3, the update
[TABLE]
is a generalization of Zinkevich’s online gradient descent, and it can be regarded as a proximal gradient update (or, mirror descent) with a regularizer. We can use Lemma 2 to obtain the following result.
Lemma 3**.**
Consider the update in Eq. (11) and let be an arbitrary sequence of vectors from . The following bound holds
[TABLE]
From Lemma 3, we obtain the usual bound on the regret131313See the discussion after Theorem 1. with the additional term due to the regularizer we have added in the update. To make the term vanish, we can apply a proximal gradient update
[TABLE]
since is concave in for a fixed . The following result is also an application of Lemma 2.
Lemma 4**.**
Consider the update in Eq. (12) and suppose there exists a constant such that for all . The following bound holds
[TABLE]
Combining Lemmas 3 and 4, we obtain the bound on the regret in Theorem 1. It only remains to show that constant exists and does not depend on . Before we proceed to do that, we establish the bound on the constraint violation.
Proposition 1** (Constraint violation).**
Select . The updates in Eq. (11) and (12) have constraint violation
[TABLE]
where for all and .
The proof of the proposition is based as well on OGD arguments. Observe from Eq. (13) that if for all (as assumed to obtain the result in Lemma 4), then we obtain the claimed bound on the constraint violation in Theorem 1. We show that constant exists in the next section.
B.2 Bounded dual variables
We take as starting point a classic result from Lagrange duality in constrained convex optimization, which says that the set of optimal dual variables is bounded when the Slater condition holds [Uza58]. This result is important because when we solve the dual problem with an iterative method, such as the subgradient method,141414See Sec. 8.2 in [BNO03] for a detailed explanation of the convergence of the dual subgradient method. See also Lemma 1 and 3 in [NO09]. we obtain a sequence of dual variables that is attracted to a bounded set. Hence, the dual variables remain bounded for all .
In our problem, since the Slater condition holds for every constraint (see Sec. 3.2), we could in principle use the same methodology than in offline constrained convex optimization by defining the time-varying Lagrange dual function
[TABLE]
However, that is not possible because each dual function depends on the previous one (through the objective ), which correlates the set of dual solutions. As a result, we cannot establish that each set , is uniformly bounded. We show this formally in the following proposition.
Proposition 2** (Bounded optimal dual variables).**
For every and any , we have
[TABLE]
Proposition 2 gives an upper bound on the set of optimal dual variables for each . The bound consists of two parts: a fixed part and a variable part that depends on the previous primal point and step size . The bound is illustrated in Fig. 3 schematically. Note that if we upper bound the term in Proposition 2 by a constant, then the variable term in the upper bound increases with since as (unless and so ). Hence, we cannot claim that each the set of optimal dual variables is uniformly bounded and, therefore, ensure that the sequence of dual variables is attracted to a bounded set—which is key to ensure that constant exists.
To deal with the issue mentioned above, we adopt another strategy and show that the dual variables remain bounded over a span of iterations. Namely, we do not measure the behavior (or, progress) of the dual variables in one step, but over multiple steps. The intuition behind our strategy is that the variable terms in Proposition 2 “cancel out” when we consider the average set of optimal dual variables. The length of the interval we consider is proportional to the step size and given by
[TABLE]
where is the ceiling function.
We proceed to present a lemma that measures the difference of the dual variables according to the Bregman distance over a span of iterations. We present first the following preliminary lemma.
Lemma 5** (Bounded subsequences).**
Let and , . The following bounds hold
[TABLE]
Lemma 6**.**
Consider the setup of Theorem 1. For any and , the following bound holds
[TABLE]
where is a constant that does not depend on .
There are two important observations from the last lemma.
- •
Observation 1: When is larger than , then the LHS of Eq. (14) is negative. Hence, is closer to the origin than with respect to the Bregman divergence “metric”. Importantly, does not depend on .
- •
Observation 2: For any , the maximum increment of the dual variables in iterations is . That is, for any since the second term in the RHS of Eq. (14) is nonnegative.
Using these two observations, we can establish an upper bound on the dual variables for all . We have the following lemma.
Lemma 7** (Bounded dual set).**
Consider the setup of Theorem 1 and as defined in Lemma 6. For all we have
[TABLE]
Appendix C Proofs
C.1 Proof of Lemma 1
The update in Eq. (9) is equivalent to where is the indicator function. That is, if and if . From the optimality condition, we have
[TABLE]
where , and is a vector in the normal cone of at , i.e., . Hence, if we multiply the last equation across by we obtain
[TABLE]
Next, rearranging terms and using the fact that (since is convex) yields
[TABLE]
Finally, since , we can use the identity in Eq. (8) with , , to obtain , which concludes the proof.
C.2 Proof of Lemma 2
From Lemma 1, we have
[TABLE]
Add to both sides of Eq. (15) and use the fact that (since is strongly convex) to obtain
[TABLE]
where . Next, observe that
[TABLE]
(since the convex conjugate of for is ; see [RW98, pp. 475]). Finally, since , we obtain the stated result.
C.3 Proof of Lemma 3
Let in Lemma 2. Note that we use instead of as the perturbation does not affect the primal variables update (see the second paragraph in Sec. 3). We have
[TABLE]
where we have used the fact that by assumption (see Sec. 3.2). Next, let , , , in Eq. (17). Summing from and rearranging terms yields
[TABLE]
Now, select and observe that
[TABLE]
where the last equation follows since for any (by construction); and by the choice of (see Theorem 1). Hence,
[TABLE]
Finally, we can upper bound the first term in the RHS of the last equation as follows
[TABLE]
where (a) follows from the smoothness of , and (b) from the assumption that (see Sec. 3.2).
C.4 Proof of Lemma 4
Let in Lemma 2 and fix and . Summing from we have
[TABLE]
where we have used the fact that for all , by assumption; see Sec. 3.2. Next, let and rearrange terms to obtain
[TABLE]
Dropping the second term in the RHS of the last equation and using the fact that (by the smoothness of ) and for all (by assumption), we obtain
[TABLE]
which concludes the proof.
C.5 Proof of Proposition 1
From the optimality condition of the update in Eq. (12) we have
[TABLE]
where is a vector in the normal cone of at , i.e., . Rearranging terms and using the fact that (i.e., ) yields
[TABLE]
Next, let , , , and sum the last equation from to obtain
[TABLE]
where (a) follows by rearranging terms and (b) by dropping the second and third terms in the RHS of (a) since they are nonpositive (note that since is a strictly increasing function and for all ). Furthermore, observe that we can write
[TABLE]
Adding to both sides
[TABLE]
and therefore
[TABLE]
Finally, if we use the fact that since is -smooth (by assumption; see Sec. 3.2) and for all , we obtain the stated result.
C.6 Proof of Lemma 5
We start with claim (i). From the integral test, we have where
[TABLE]
Hence, we need to upper and lower bound . For the upper bound, observe that
[TABLE]
The equation in the RHS is decreasing in for a fixed , but also decreasing in for a fixed . Thus, the maximum is attained when and . Hence, .
We continue with the lower bound. Observe that
[TABLE]
For any , the minimum is attained when and equal to
[TABLE]
Hence, as claimed.
We proceed to show claim (ii). Using again the fact that , we can write
[TABLE]
Like in the first case, the maximum is attained when and and therefore as claimed.
C.7 Proof of Lemma 6
From Lemma 2 with and we have
[TABLE]
Now observe that
[TABLE]
where (a) follows from Lemma 1 for any ; (b) since by Hölder’s inequality and by dropping ; and (c) by letting (a Slater point that satisfies all the constraints; see Sec. 3.2) and the fact that
[TABLE]
for some . Hence, by multiplying Eq. (21) across by we have
[TABLE]
Next, let , , and sum from with
[TABLE]
where in the last equation we have used the fact that for all , and that for all . Finally, by using the bounds in Lemma 5, we obtain that
[TABLE]
for any .
C.8 Proof of Lemma 7
Define set
[TABLE]
and consider the following two observations from Lemma 6.
- •
Observation (i). For every and , we have
[TABLE]
- •
Observation (ii). For every , if for all , then
[TABLE]
where (a) follows since if , and (b) because by Lemma 5. The last bound also holds if for all with , . That is,
[TABLE]
Now, consider the case where but .151515Note that . Fix for some and . Combining the two observations, we can write
[TABLE]
Next, observe that if we use the fact that (by the strong convexity and the smoothness of ), we can write
[TABLE]
where (a) follows since . The last equation concludes the proof since .
C.9 Proof of Proposition 2
Let in Lemma 1 with and . We have that
[TABLE]
Rearranging terms and dropping yields
[TABLE]
Next, observe that
[TABLE]
where and Eq. (25) follows by complementary slackness [BV04, Sec. 5.5.2], i.e.,
[TABLE]
Eq. (26) follows since by Hölder’s inequality and the fact that set is bounded (see Sec. 3.2). Hence, we can upper bound the first two terms in Eq. (24) and obtain
[TABLE]
Finally, let be a vector that satisfies the Slater condition (see Sec. 3.2), and note that
[TABLE]
Using the last bound; diving across by ; and letting , we obtain the stated result.
Appendix D Lagrange Duality
This part is not directly related to the online problem in the main part of the paper, however, we think it may be useful as support material for the readers that are not familiar with Lagrange duality. To streamline exposition, we use standard convex optimization notation (e.g., [Roc70, BV04]).
Let be a convex function, a collection of inequality convex constraints161616An equality constraint can be written with two inequality constraints., and a convex set. Define
[TABLE]
where is the indicator function, i.e., if for all , and if for some . Clearly, finding the that minimizes is equivalent to finding the that minimizes such that every constraint is satisfied (i.e., the value of every constraint is less than or equal to zero).
Now, consider the perturbed function
[TABLE]
where is a vector from . Note that , and that is convex in for a fixed . Hence, we can write the convex conjugate [Roc70, Sec. 12] of with respect to the perturbation for a fixed as follows
[TABLE]
where the last equation follows since if for some , then — which will never be the case since we can always select as . Next, observe that can be written as the linear program
[TABLE]
It is easy to see that if for some , then the problem’s solution is unbounded above (i.e., we can always select a as negative as we want), and otherwise equal to . Hence,
[TABLE]
Finally, observe that if we let , we obtain
[TABLE]
which is the classic definition of the Lagrangian for . It is well-known that when the Slater condition holds [BV04, Ch. 5], then
[TABLE]
where , i.e., the solution to the “unperturbed” problem. Note as well that is indeed like ; if then is equal to , and otherwise, when , is equal to .
Intuition behind how Lagrange duality fits into our problem. The perturbed constraints in the main body of the paper can be regarded as if we had the static constraint where is the average of the “perturbations” for any horizon . The issue is that the average is not known a priori and is only revealed as we keep playing actions in each round. Hence, we can not use an approach with “hard” constraints such as minimizing . Instead, we relax the constraints and formulate the Lagrange dual problem. Specifically, we let
[TABLE]
and aim to maximize by carrying out the (sub)gradient ascent update
[TABLE]
where is a step size and a subgradient of at . Note that the projection of the dual variables onto the nonnegative orthant is because if form some . The crucial part is that is given by where
[TABLE]
That is, does not depend on the average .
Now observe that we can write the “noisy” version of update as follows
[TABLE]
where is a “noise” vector. In words, the update can be regarded as a “stochastic” dual subgradient ascent since for all for any horizon . However, note that as the average changes, we are changing the set of feasible solutions (or equivalently, the optimization problem itself). This corresponds to the time-varying feasible set in the main body of the paper. Recall also that since we do not add any statistical properties to the sequence of perturbations, we are restricted to comparing our solutions with the more restrictive set .
The difficulties of applying the approach presented in this section to the online setting are that (i) the cost function varies over time and (ii) that this is not known in advance (i.e., it is learnt after the action has been played). To deal with these issues, we replace for its gradient (as explain in Sec. 3) and use a primal-dual proximal gradient approach [Van16, Lecture 12] as explained in the Sec. A.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AD 15] Shipra Agrawal and Nikhil R. Devanur. Fast algorithms for online stochastic convex programming. In Proceedings of the Twenty-sixth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1405–1424, 2015.
- 2[BNO 03] Dimitri P. Bertsekas, Angelia Nedić, and Asuman E. Ozdaglar. Convex analysis and optimization . Athena Scientific, 2003.
- 3[BT 03] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters , 31(3):167 – 175, 2003.
- 4[BV 04] Stephne Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004.
- 5[CBL 06] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games . Cambridge University Press, 2006.
- 6[CGP 16] Andrew Cotter, Maya Gupta, and Jan Pfeifer. A light touch for heavily constrained sgd. In 29th Annual Conference on Learning Theory , volume 49 of Proceedings of Machine Learning Research , pages 729–771, 2016.
- 7[CT 93] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization , 3(3):538–543, 1993.
- 8[DHS 11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.
