Gradient Methods with Regularization for Constrained Optimization   Problems and Their Complexity Estimates

Igor Konnov

arXiv:1705.01396·math.OC·May 4, 2017

Gradient Methods with Regularization for Constrained Optimization Problems and Their Complexity Estimates

Igor Konnov

PDF

Open Access

TL;DR

This paper introduces modified gradient and conditional gradient methods for smooth convex optimization in Hilbert spaces, achieving strong convergence and comparable complexity estimates to traditional weakly convergent methods.

Contribution

It proposes simple, implementable modifications that ensure strong convergence and provide complexity estimates for these optimization methods.

Findings

01

Achieve strong convergence in convex optimization

02

Maintain similar complexity estimates to weakly convergent methods

03

Provide practical modifications for gradient-based algorithms

Abstract

We suggest simple implementable modifications of conditional gradient and gradient projection methods for smooth convex optimization problems in Hilbert spaces. Usually, the custom methods attain only weak convergence. We prove strong convergence of the new versions and establish their complexity estimates, which appear similar to the convergence rate of the weakly convergent versions.

Equations187

x \in D min \to f (x),

x \in D min \to f (x),

f^{*} = x \in D in f f (x) .

f^{*} = x \in D in f f (x) .

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y);

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y);

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y) - 0.5 ϰ α (1 - α) ∥ x - y ∥^{2};

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y) - 0.5 ϰ α (1 - α) ∥ x - y ∥^{2};

k \to \infty lim sup f (x^{k}) \leq f (z) (k \to \infty lim inf f (x^{k}) \geq f (z)) .

k \to \infty lim sup f (x^{k}) \leq f (z) (k \to \infty lim inf f (x^{k}) \geq f (z)) .

x \in D min \to {f (x) + εφ (x)},

x \in D min \to {f (x) + εφ (x)},

x \in D^{*} (f) min \to φ (x) .

x \in D^{*} (f) min \to φ (x) .

x^{k + 1} = π_{D} [x^{k} - λ_{k} (f^{'} (x^{k}) + ε_{k} φ^{'} (x^{k}))], ε_{k} > 0, λ_{k} > 0, k = 0, 1, \dots;

x^{k + 1} = π_{D} [x^{k} - λ_{k} (f^{'} (x^{k}) + ε_{k} φ^{'} (x^{k}))], ε_{k} > 0, λ_{k} > 0, k = 0, 1, \dots;

\begin{array}[]{lc}\displaystyle\lim_{k\rightarrow\infty}\varepsilon_{k}=0,\ \lim_{k\rightarrow\infty}(\lambda_{k}/\varepsilon_{k})=0,&\\ \displaystyle\lim_{k\rightarrow\infty}\frac{\varepsilon_{k}-\varepsilon_{k+1}}{\lambda_{k}\varepsilon_{k}^{2}}=0,\ \sum\limits_{k=0}^{\infty}(\varepsilon_{k}\lambda_{k})=\infty;&\\ \end{array}

\begin{array}[]{lc}\displaystyle\lim_{k\rightarrow\infty}\varepsilon_{k}=0,\ \lim_{k\rightarrow\infty}(\lambda_{k}/\varepsilon_{k})=0,&\\ \displaystyle\lim_{k\rightarrow\infty}\frac{\varepsilon_{k}-\varepsilon_{k+1}}{\lambda_{k}\varepsilon_{k}^{2}}=0,\ \sum\limits_{k=0}^{\infty}(\varepsilon_{k}\lambda_{k})=\infty;&\\ \end{array}

∥ f^{'} (x) ∥ \leq M (1 + ∥ x ∥) \mbox an d ∥ φ^{'} (x) ∥ \leq M (1 + ∥ x ∥) \forall x \in D .

∥ f^{'} (x) ∥ \leq M (1 + ∥ x ∥) \mbox an d ∥ φ^{'} (x) ∥ \leq M (1 + ∥ x ∥) \forall x \in D .

λ_{k} = (k + 1)^{- 0.5}, ε_{k} = (k + 1)^{- τ}, τ \in (0, 0.5) .

λ_{k} = (k + 1)^{- 0.5}, ε_{k} = (k + 1)^{- τ}, τ \in (0, 0.5) .

x^{k + 1} = π_{D} [x^{k} - λ_{k} f^{'} (x^{k})], λ_{k} > 0, k = 0, 1, \dots,

x^{k + 1} = π_{D} [x^{k} - λ_{k} f^{'} (x^{k})], λ_{k} > 0, k = 0, 1, \dots,

λ_{k} \in [λ^{'}, λ^{''}], λ^{'} > 0, λ^{''} < 2/ L .

λ_{k} \in [λ^{'}, λ^{''}], λ^{'} > 0, λ^{''} < 2/ L .

Δ (x^{k}) \leq C / k \mbox f or k = 0, 1, \dots

Δ (x^{k}) \leq C / k \mbox f or k = 0, 1, \dots

x \in D min \to φ_{ε} (x) = {f (x) + 0.5 ε ∥ x ∥^{2}},

x \in D min \to φ_{ε} (x) = {f (x) + 0.5 ε ∥ x ∥^{2}},

φ_{ε}^{*} = x \in D in f φ_{ε} (x) .

φ_{ε}^{*} = x \in D in f φ_{ε} (x) .

∥ x^{k} - y^{k} ∥ \leq δ_{l},

∥ x^{k} - y^{k} ∥ \leq δ_{l},

φ_{ε_{l}} (x^{k} + θ^{m} d^{k}) \leq φ_{ε_{l}} (x^{k}) - β θ^{m} ∥ d^{k} ∥^{2},

φ_{ε_{l}} (x^{k} + θ^{m} d^{k}) \leq φ_{ε_{l}} (x^{k}) - β θ^{m} ∥ d^{k} ∥^{2},

⟨ φ_{ε_{l}}^{'} (x^{k}) + y^{k} - x^{k}, x - y^{k} ⟩ \geq 0 \forall x \in D;

⟨ φ_{ε_{l}}^{'} (x^{k}) + y^{k} - x^{k}, x - y^{k} ⟩ \geq 0 \forall x \in D;

φ_{ε_{l}} (y) \leq φ_{ε_{l}} (x) + ⟨ φ_{ε_{l}}^{'} (x), y - x ⟩ + 0.5 L^{'} ∥ y - x ∥^{2};

φ_{ε_{l}} (y) \leq φ_{ε_{l}} (x) + ⟨ φ_{ε_{l}}^{'} (x), y - x ⟩ + 0.5 L^{'} ∥ y - x ∥^{2};

φ_{ε_{l}} (x^{k} + λ d^{k}) - φ_{ε_{l}} (x^{k}) \leq λ ⟨ φ_{ε_{l}}^{'} (x^{k}), d^{k} ⟩ + 0.5 L^{'} λ^{2} ∥ d^{k} ∥^{2}

φ_{ε_{l}} (x^{k} + λ d^{k}) - φ_{ε_{l}} (x^{k}) \leq λ ⟨ φ_{ε_{l}}^{'} (x^{k}), d^{k} ⟩ + 0.5 L^{'} λ^{2} ∥ d^{k} ∥^{2}

\leq - λ (1 - 0.5 L^{'} λ) ∥ d^{k} ∥^{2} \leq - β λ ∥ d^{k} ∥^{2},

0.5 ε_{l} ∥ y^{k} - z (ε_{l}) ∥^{2} \leq φ_{ε_{l}} (y^{k}) - φ_{ε_{l}}^{*} \leq (L^{'} + 1) ∥ y^{k} - x^{k} ∥∥ y^{k} - z (ε_{l}) ∥

0.5 ε_{l} ∥ y^{k} - z (ε_{l}) ∥^{2} \leq φ_{ε_{l}} (y^{k}) - φ_{ε_{l}}^{*} \leq (L^{'} + 1) ∥ y^{k} - x^{k} ∥∥ y^{k} - z (ε_{l}) ∥

0.5 ε_{l} ∥ y^{k} - z (ε_{l}) ∥^{2} \leq φ_{ε_{l}} (y^{k}) - φ_{ε_{l}}^{*} \leq ⟨ φ_{ε_{l}}^{'} (y^{k}), y^{k} - z (ε_{l})⟩;

0.5 ε_{l} ∥ y^{k} - z (ε_{l}) ∥^{2} \leq φ_{ε_{l}} (y^{k}) - φ_{ε_{l}}^{*} \leq ⟨ φ_{ε_{l}}^{'} (y^{k}), y^{k} - z (ε_{l})⟩;

⟨ φ_{ε_{l}}^{'} (y^{k}), y^{k} - z (ε_{l})⟩ \leq ⟨ φ_{ε_{l}}^{'} (y^{k}) - φ_{ε_{l}}^{'} (x^{k}) - (y^{k} - x^{k}), y^{k} - z (ε_{l})⟩

⟨ φ_{ε_{l}}^{'} (y^{k}), y^{k} - z (ε_{l})⟩ \leq ⟨ φ_{ε_{l}}^{'} (y^{k}) - φ_{ε_{l}}^{'} (x^{k}) - (y^{k} - x^{k}), y^{k} - z (ε_{l})⟩

+ ⟨ φ_{ε_{l}}^{'} (x^{k}) + (y^{k} - x^{k}), y^{k} - z (ε_{l})⟩

\leq ⟨ φ_{ε_{l}}^{'} (y^{k}) - φ_{ε_{l}}^{'} (x^{k}) - (y^{k} - x^{k}), y^{k} - z (ε_{l})⟩

\leq (∥ φ_{ε_{l}}^{'} (y^{k}) - φ_{ε_{l}}^{'} (x^{k}) ∥ + ∥ y^{k} - x^{k} ∥) ∥ y^{k} - z (ε_{l}) ∥

\leq (L^{'} + 1) ∥ y^{k} - x^{k} ∥∥ y^{k} - z (ε_{l}) ∥.

k \to \infty lim (δ_{l} / ε_{l}) = 0.

k \to \infty lim (δ_{l} / ε_{l}) = 0.

∥ y^{k (l)} - z (ε_{l}) ∥ \leq 2 (L^{'} + 1) δ_{l} / ε_{l},

∥ y^{k (l)} - z (ε_{l}) ∥ \leq 2 (L^{'} + 1) δ_{l} / ε_{l},

∥ w^{l} - y^{k (l)} + y^{k (l)} - z (ε_{l}) ∥ \leq ∥ w^{l} - y^{k (l)} ∥ + ∥ y^{k (l)} - z (ε_{l}) ∥ \leq δ_{l} + ∥ y^{k (l)} - z (ε_{l}) ∥,

∥ w^{l} - y^{k (l)} + y^{k (l)} - z (ε_{l}) ∥ \leq ∥ w^{l} - y^{k (l)} ∥ + ∥ y^{k (l)} - z (ε_{l}) ∥ \leq δ_{l} + ∥ y^{k (l)} - z (ε_{l}) ∥,

∥ w^{l} - z (ε_{l}) ∥ \leq (2 (L^{'} + 1) / ε_{l} + 1) δ_{l} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical methods in inverse problems · Optimization and Variational Analysis · Advanced Optimization Algorithms Research

Full text

Gradient Methods with Regularization

for Constrained Optimization Problems

**and Their Complexity Estimates **

I.V. Konnov111E-mail: [email protected]

*Department of System Analysis and Information Technologies,

Kazan Federal University, ul. Kremlevskaya, 18, Kazan 420008, Russia.*

Abstract

We suggest simple implementable modifications of conditional gradient and gradient projection methods for smooth convex optimization problems in Hilbert spaces. Usually, the custom methods attain only weak convergence. We prove strong convergence of the new versions and establish their complexity estimates, which appear similar to the convergence rate of the weakly convergent versions.

Key words: Convex optimization; Hilbert space; gradient projection method; conditional gradient method; strong convergence; complexity estimates.

MSC codes: 90C25, 65K05, 65J20

1 Introduction

Let $D$ be a convex set in a real Hilbert space $H$ and $f:D\rightarrow\mathbb{R}$ a convex function. Then one can define the optimization problem of finding the minimal value of the function $f$ over the feasible set $D$ . For brevity, we write this problem as

[TABLE]

its solution set is denoted by $D^{*}(f)$ and the optimal value of the function by $f^{*}$ , i.e.

[TABLE]

For many significant applications this problem appears ill-posed, i.e. its solution does not depend continuously on the input data. At the same time, the custom convex optimization methods can in general provide only weak convergence to a solution, hence, they do not guarantee sufficient distance approximation of the solution set $D^{*}(f)$ , besides, even small perturbation of the input data may give large deviations from the solution. In order to overcome these drawbacks, various regularization techniques that yield the strong convergence can be applied; see e.g. [1]–[4]. The most popular and efficient regularization method was suggested by A.N. Tikhonov; see [5].

That is, a family of perturbed problems with better properties is solved instead of the initial one. However, the solution of such a perturbed problem within a prescribed accuracy may be too difficult even for the convex optimization problem (1). At the same time, various simple and implementable versions of the regularization methods yield slow convergence due to the special restrictive rules for the choice of step-size and regularization parameters; see e.g. [2, 3].

In this paper, we suggest an intermediate variant of the implementable regularization method. We take the conditional gradient and gradient projection methods as basic ones. At each iteration of the selected method it is applied to some perturbed convex optimization problem. Unlike the known iterative regularization methods (see [2]), we change the perturbed problem only after satisfying some simple estimate inequality, which allows us to utilize rather mild rules for the choice of the parameters. Within these rules we prove strong convergence and establish some complexity estimates for these two-level methods. In particular, they show that this way of incorporating the regularization techniques gives almost the same convergence rate as the custom single-level methods, which provide only weak convergence.

2 Properties of regularization

methods

We first recall some definitions. Given a set $X$ , a function $f:X\rightarrow\mathbb{R}$ is said to be

(a) convex, if for each pair of points $x,y\in X$ and for all $\alpha\in[0,1]$ , it holds that

[TABLE]

(b) strongly convex with constant $\varkappa>0$ , if for each pair of points $x,y\in X$ and for all $\alpha\in[0,1]$ , it holds that

[TABLE]

(c) upper (lower) semicontinuous at a point $z\in X$ , if for each sequence $\{x^{k}\}\to z$ , $x^{k}\in X$ , it holds that

[TABLE]

We will consider problem (1) under the following basic assumptions.

(A1) $D$ * is a nonempty, convex and closed subset of a real Hilbert space $H$ , $f:D\rightarrow\mathbb{R}$ is a lower semicontinuous and convex function.*

The classical Tikhonov regularization method (see [5]) consists in replacing problem (1) with a sequence of perturbed problems of the form

[TABLE]

where $\varphi:H\rightarrow\mathbb{R}$ is a lower semicontinuous and strongly convex function, $\varepsilon>0$ is a regularization parameter. We recall the basic approximation property; see e.g. [1, Chapter II, Section 5, Theorem 1].

Proposition 1

Suppose that all the assumptions in (A1) are fulfilled, $D^{*}(f)\neq\varnothing$ , and that $\varphi:H\rightarrow\mathbb{R}$ is a lower semicontinuous and strongly convex function. Then:

(i) problem (2) has a unique solution $z(\varepsilon)$ for each $\varepsilon>0$ ;

(ii) if $\{\varepsilon_{k}\}\searrow 0$ as $k\to+\infty$ , the corresponding sequence $\{z(\varepsilon_{k})\}$ converges strongly to the point $x^{*}_{n}$ that is the unique solution of the problem

[TABLE]

The main issue of the above regularization method consists in its suitable implementation since we can not find the point $z(\varepsilon)$ exactly in the general nonlinear case. Clearly, instead of $z(\varepsilon)$ we can in principle take any point $\tilde{z}(\varepsilon)\in D$ such that $\|\tilde{z}(\varepsilon)-z(\varepsilon)\|\leq\xi(\varepsilon)$ with $\xi(\varepsilon)\searrow 0$ as $\varepsilon\searrow 0$ . Then $\{\tilde{z}(\varepsilon_{k})\}$ also converges strongly to the point $x^{*}_{n}$ in case (ii) of Proposition 1. However, it is not so easy to guarantee even the prescribed distance approximation to the point $z(\varepsilon)$ in the general case.

In [6], the so-called iterative regularization method was proposed; see [2] for more details. The idea of this method consists in simultaneous changes of the regularization parameters and step-sizes of a chosen basic approximation method. In particular, if the functions $f:D\rightarrow\mathbb{R}$ and $\varphi:H\rightarrow\mathbb{R}$ are smooth, we can take the basic gradient projection method for problem (2). Then the corresponding iterative procedure can be determined as follows:

[TABLE]

where

[TABLE]

and $x^{0}\in D$ . Here and below, $\pi_{X}(x)$ denotes the projection of $x$ onto $X$ .

Proposition 2

[2, Theorem 3.1]** Suppose that all the assumptions in (A1) are fulfilled, $D^{*}(f)\neq\varnothing$ , the function $f:D\rightarrow\mathbb{R}$ is smooth, the function $\varphi:H\rightarrow\mathbb{R}$ is smooth and strongly convex, there exists a constant $M$ such that

[TABLE]

Then any sequence $\{x^{k}\}$ generated in conformity with rules (3) – (4) converges strongly to the point $x^{*}_{n}$ .

Of course, the implementation of method (3) – (4) is relatively simple. Observe that the conditions in (4) are fulfilled if we set

[TABLE]

This means that the convergence of the iterative regularization method may be rather slow in comparison with that of the basic method. In fact, let us consider the custom gradient projection method:

[TABLE]

and $x^{0}\in D$ . For brevity, set $\Delta(x)=f(x)-f^{*}$ .

(A2) The function $f:D\rightarrow\mathbb{R}$ is smooth and its gradient satisfies the Lipschitz condition with constant $L$ .

Proposition 3

([7, Theorem 5.1] and [8, Chapter III, Theorem 2.6]) Suppose that (A1) and (A2) are fulfilled, a sequence $\{x^{k}\}$ is generated in conformity with rule (5) where

[TABLE]

Then these exists some constant $C<+\infty$ such that

[TABLE]

It is well known that method (5) – (6), unlike (3) – (4), provides only weak convergence. At the same time, comparing the step-size rules (4) and (6) we can conclude that it seems rather difficult to obtain the estimate similar to (7) for the iterative regularization method (3) – (4). The same convergence properties were established for the gradient projection method with some other known step-size rules such as the exact one-dimensional minimization and Armijo rules.

3 Two-level gradient projection method with regularization

We now describe some other way to create an implementable regularization method, which is based on the gradient projection method. The method is applied to problem (1) under the assumptions (A1) and (A2). At each iteration, the gradient projection method is applied to some perturbed problem of form (2), however, the perturbed problem is changed only after satisfying some simple estimate inequality, unlike the above regularization methods. For the simplicity of exposition, we take the standard perturbation function $\varphi(x)=0.5\|x\|^{2}$ , then we rewrite the perturbed problem

[TABLE]

and set

[TABLE]

Observe that problem (8) has the unique solution $z(\varepsilon)$ for each $\varepsilon>0$ under the assumptions (A1) and (A2) due to Proposition 1 (i), hence $\varphi^{*}_{\varepsilon}=\varphi_{\varepsilon}(z(\varepsilon))$ . Denote by $\mathbb{Z}_{+}$ the set of non-negative integers.

Method (GPRM).

Step 0: Choose a point $w^{0}\in D$ , numbers $\beta\in(0,1)$ , $\theta\in(0,1)$ , sequences $\{\delta_{l}\}\searrow 0$ and $\{\varepsilon_{l}\}\searrow 0$ . Set $l=1$ .

Step 1: Set $x^{0}=w^{l-1}$ , $k=0$ .

Step 2: Take $y^{k}=\pi_{D}[x^{k}-\varphi_{\varepsilon_{l}}^{\prime}(x^{k})]$ . If

[TABLE]

set $w^{l}={\rm argmin}\{\varphi_{\varepsilon_{l}}(x^{k}),\varphi_{\varepsilon_{l}}(y^{k})\}$ , $l=l+1$ and go to Step 1. (Change the perturbation)

Step 3: Set $d^{k}=y^{k}-x^{k}$ , determine $m$ as the smallest number in $\mathbb{Z}_{+}$ such that

[TABLE]

set $\lambda_{k}=\theta^{m}$ , $x^{k+1}=x^{k}+\lambda_{k}d^{k}$ , $k=k+1$ , and go to Step 2.

We see that the upper level changes the current perturbed problem which is associated to the index $l$ , whereas the lower level with iterations in $k$ is nothing but the custom gradient projection method with the Armijo step-size rule applied to the fixed perturbed problem (8) with $\varepsilon=\varepsilon_{l}$ . Clearly, condition (9) is very simple and suitable for the verification.

We now give some useful properties of the gradient projection method.

Lemma 1

Suppose that (A1) and (A2) are fulfilled. Fix any $l$ . Then we have

[TABLE]

for any $k=0,1,\ldots$ ; besides, $\lambda_{k}\geq\gamma>0$ for any $k=0,1,\ldots$

Proof. Relation (11) follows directly for the projection properties. Next, under the assumptions made the gradient of the function $\varphi_{\varepsilon_{l}}$ satisfies the Lipschitz condition with constant $L^{\prime}=L+\varepsilon_{0}$ . Hence, for any pair of points $x,y$ we now have

[TABLE]

see [8, Chapter III, Lemma 1.2]. Then (11) gives

[TABLE]

if $\lambda\leq\bar{\lambda}=2(1-\beta)/L^{\prime}$ . It follows from (10) that $\lambda_{k}\geq\gamma=\min\{1,\theta\bar{\lambda}\}>0$ . $\Box$

We show that the sequence of perturbed problems is infinite.

Lemma 2

Suppose that (A1) and (A2) are fulfilled. Then the number of iterations in $k$ for each number $l$ is finite.

Proof. It follows from (10) and Lemma 1 that $\varphi_{\varepsilon_{l}}(x^{k+1})\leq\varphi_{\varepsilon_{l}}(x^{k})-\beta\gamma\|d^{k}\|^{2}$ , but $\varphi^{*}_{\varepsilon}>-\infty$ , hence $\lim\limits_{k\rightarrow\infty}d^{k}=\mathbf{0}$ , and the result follows. $\Box$

The next property enables us to evaluate the approximation error.

Lemma 3

Suppose that (A1) and (A2) are fulfilled. Fix any $l$ . Then

[TABLE]

for any $k=0,1,\ldots$

Proof. Since $\varphi_{\varepsilon_{l}}$ is strongly convex with modulus $\varepsilon_{l}$ , we have

[TABLE]

see e.g. [1, Chapter I, Section 2]. Next, (11) gives

[TABLE]

It follows that (12) holds true. $\Box$

We are ready to establish the basic convergence property for (GPRM).

Theorem 1

Suppose that (A1) and (A2) are fulfilled and $D^{*}(f)\neq\varnothing$ , we apply (GPRM) with

[TABLE]

Then:

(i) the number of iterations in $k$ for each number $l$ is finite;

(ii) the sequence $\{w^{l}\}$ converges strongly to the point $x^{*}_{n}$ .

Proof. Assertion (i) has been obtained in Lemma 2. Fix any $l$ and denote by $k(l)$ the maximal value of the index $k$ for this $l$ , i.e. $\|y^{k(l)}-x^{k(l)})\|\leq\delta_{l}$ . Then (12) gives

[TABLE]

but

[TABLE]

hence

[TABLE]

Therefore, by (13),

[TABLE]

Due to Proposition 1 (ii), $\{z(\varepsilon_{l})\}$ converges strongly to $x^{*}_{n}$ . Therefore, assertion (ii) is also true. $\Box$

We observe that inserting the control sequence $\{\delta_{l}\}$ does not require additional computational expenses per iteration, but implies the strong convergence, whereas the usual gradient projection method provides only weak convergence as indicated above. Besides, rule (13) is clearly less restrictive than (4) and maintains significant freedom for the choice of parameters.

4 Complexity estimate

It was observed in Section 2 that the usual gradient projection method has the convergence rate $\Delta(x^{k})\leq C/k$ under the assumptions (A1) and (A2); see Proposition 3 and the remarks below. This means that the total number of iterations $N(\alpha)$ that is necessary for attaining some prescribed accuracy $\alpha>0$ is estimated as follows:

[TABLE]

We intend to obtain a similar estimate for (GPRM). Namely, we define the complexity of (GPRM), denoted by $N(\alpha)$ , as the total number of iterations in $k$ that is necessary for attaining any accuracy $\alpha>0$ . In order to establish an upper bound for $N(\alpha)$ we need certain auxiliary properties. We recall that $z(\varepsilon)$ denotes the solution of the perturbed problem (8) for $\varepsilon>0$ , which is defined uniquely under (A1). Hence $z(0)$ denotes any solution of problem (1).

Lemma 4

Suppose that (A1) holds. Then for any numbers $\mu$ and $\eta$ such that $0\leq\mu<\eta$ we have

[TABLE]

Proof. By definition,

[TABLE]

These relations give (15) and (16), besides, we also have

[TABLE]

which gives (17). $\Box$

Denote by $N_{(l)}$ the total number of iterations in $k$ for any fixed $l$ in (GPRM) and by $l(\alpha)$ the maximal number $l$ of the upper iteration such that $\alpha\leq\Delta(w^{l})$ for any given $\alpha>0$ . Then we can evaluate the complexity of (GPRM) as follows:

[TABLE]

Using this inequality, we now obtain the basic estimate.

Theorem 2

Suppose that (A1) and (A2) are fulfilled and $D^{*}(f)\neq\varnothing$ , we apply (GPRM) with

[TABLE]

Then (GPRM) has the complexity estimate

[TABLE]

where $C_{1}=2(L^{\prime}+1)^{2}\varepsilon_{0}^{1+2\sigma}+0.5\varepsilon_{0}\|x^{*}_{n}\|^{2}$ and $C_{2}=C_{1}/(\beta\gamma\varepsilon^{2(1+\sigma)}_{0})$ .

Proof. First we note that (19) implies (13), hence all the assertions of Theorem 1 remain true. Fix any $l$ . Then, due to (10) and Lemma 1, we have

[TABLE]

therefore,

[TABLE]

However,

[TABLE]

From (12) we have

[TABLE]

hence

[TABLE]

It follows that

[TABLE]

whereas (16) and (17) give

[TABLE]

Therefore,

[TABLE]

where

[TABLE]

Using these relations in (20) we have

[TABLE]

where

[TABLE]

In view of (18) and (22) we obtain

[TABLE]

We now proceed to evaluate $\nu^{-l(\alpha)}$ . By definition,

[TABLE]

hence,

[TABLE]

From (21) we have

[TABLE]

whereas applying (15) with $\mu=0$ and $\eta=\varepsilon_{l}$ gives

[TABLE]

Therefore,

[TABLE]

In view of (19) we have

[TABLE]

It follows that $\nu^{-l(\alpha)}\leq C_{1}/\alpha$ . Applying this inequality in (23) we obtain

[TABLE]

and the result follows. $\Box$

From Theorem 2 we conclude that the complexity estimate of (GPRM) tends to (14) when $\sigma\to 0$ . However, we can choose $\sigma$ arbitrarily in $(0,1]$ . Therefore, taking $\sigma$ small enough, we can obtain any approximation of the convergence rate of the usual gradient projection method under the same assumptions. At the same time, (GPRM), unlike the gradient projection method, attains the strong convergence.

5 Two-level conditional gradient method with regularization

We now describe a similar modification of the conditional gradient method under the following basic assumptions for problem (1).

(A3) $D$ * is a nonempty, convex, closed, and bounded subset of a real Hilbert space $H$ , $f:D\rightarrow\mathbb{R}$ is a smooth convex function and its gradient satisfies the Lipschitz condition with constant $L$ .*

The boundedness of $D$ guarantees the method is well-defined. Besides, now problem (1) has a solution, i.e. $D^{*}(f)\neq\varnothing$ . We recall that the conditional gradient method was first suggested in [9] for the case when the goal function is quadratic and the feasible set is polyhedral and further was developed by many authors; see e.g. [7, 8, 10, 11, 12]. The main idea of this method consists in linearization of the goal function, so that solution of the linearized problem over the initial feasible set serves for finding the descent direction.

Following [7, 8], we describe one of the various versions of the custom conditional gradient method.

Method (CGM).

Step 0: Choose a point $x^{0}\in D$ , set $k=0$ .

Step 1: Find a point $y^{k}\in D$ as a solution of the problem

[TABLE]

set $d^{k}=y^{k}-x^{k}$ .

Step 2: If $d^{k}=\mathbf{0}$ , stop. Otherwise choose a number $\theta_{k}>0$ , set $\beta_{k}=-\langle f^{\prime}(x^{k}),d^{k}\rangle/\|d^{k}\|^{2}$ , $\lambda_{k}=\min\{1,\theta_{k}\beta_{k}\}$ , $x^{k+1}=x^{k}+\lambda_{k}d^{k}$ , $k=k+1$ , and go to Step 1.

Clearly, termination of the method yields a solution. For this reason, we will consider only the non-trivial case where the sequence $\{x^{k}\}$ is infinite.

Proposition 4

([7, Theorem 6.1] and [8, Chapter III, Theorem 1.7]) Suppose that (A3) is fulfilled, a sequence $\{x^{k}\}$ is generated by (CGM) where

[TABLE]

Then these exists some constant $C<+\infty$ such that

[TABLE]

That is, estimate (24) is the same as (7), but it can not be enhanced even if the function $f$ is strongly convex. Besides, (CGM) also provides only weak convergence. The same convergence properties were established for the conditional gradient method with the other known step-size rules such as the exact one-dimensional minimization and Armijo rules; see [8, 10, 11].

Some versions of the iterative regularization method based on the conditional gradient iterations were described in [1, Chapter II, Section 11] and [2, Chapter IV, Section 1]. They provides strong convergence but utilize the restrictive control rules for the regularization parameters and step-sizes, which are similar to (4). In particular, the version from [2] utilizes the exact one-dimensional minimization for the choice of the step-size and take the rule

[TABLE]

for the regularization parameter. This means that the convergence of the iterative regularization version may be rather slow in comparison with that of the basic conditional gradient method.

We now describe some other implementable conditional gradient method with regularization, which follows the approach given in Section 3. That is, the custom conditional gradient method is applied to some perturbed problem of form (2), however, the perturbed problem is changed only after satisfying some simple estimate inequality. We also take the standard perturbation function $\varphi(x)=0.5\|x\|^{2}$ , hence we take the perturbed problem (8), which has the unique solution $z(\varepsilon)$ for each $\varepsilon>0$ under the assumptions in (A3).

Method (CGRM).

Step 0: Choose a point $w^{0}\in D$ , numbers $\beta\in(0,1)$ , $\theta\in(0,1)$ , sequences $\{\delta_{l}\}\searrow 0$ and $\{\varepsilon_{l}\}\searrow 0$ . Set $l=1$ .

Step 1: Set $x^{0}=w^{l-1}$ , $k=0$ .

Step 2: Find a point $y^{k}\in D$ as a solution of the problem

[TABLE]

set $d^{k}=y^{k}-x^{k}$ , $\mu_{k,l}=-\langle\varphi_{\varepsilon_{l}}^{\prime}(x^{k}),d^{k}\rangle$ . If

[TABLE]

set $w^{l}=x^{k}$ , $l=l+1$ and go to Step 1. (Change the perturbation)

Step 3: Determine $m$ as the smallest number in $\mathbb{Z}_{+}$ such that

[TABLE]

set $\lambda_{k}=\theta^{m}$ , $x^{k+1}=x^{k}+\lambda_{k}\mu_{k,l}d^{k}$ , $k=k+1$ , and go to Step 2.

We see again that the upper level changes the current perturbed problem associated to the index $l$ , whereas the lower level with iterations in $k$ is nothing but the conditional gradient method with the Armijo step-size rule applied to the fixed perturbed problem. Clearly, condition (25) is very simple and suitable for the verification.

We now give a lower bound for the step-size.

Lemma 5

Suppose that (A3) is fulfilled. Fix any $l$ . Then

[TABLE]

for any $k=0,1,\ldots$

Proof. It was noticed that, under the assumptions made the gradient of the function $\varphi_{\varepsilon_{l}}$ satisfies the Lipschitz condition with constant $L^{\prime}=L+\varepsilon_{0}$ . Hence, for any pair of points $x,y$ we now have

[TABLE]

Therefore,

[TABLE]

if

[TABLE]

or $\lambda\leq\lambda^{\prime}=2(1-\beta)/L^{\prime}B^{2}$ , where $B$ denotes the diameter of the set $D$ . Fix any point $\bar{x}\in D$ . Then

[TABLE]

hence setting $\lambda^{\prime\prime}=1/(L^{\prime\prime}B)$ gives $\mu_{k,l}\lambda^{\prime\prime}\leq 1$ . Set $\gamma=\min\{\theta,\lambda^{\prime},\lambda^{\prime\prime}\}>0$ . It follows now from (26) that $\lambda_{k}\geq\gamma$ . $\Box$

We now show that the sequence of perturbed problems is infinite.

Lemma 6

Suppose that (A3) is fulfilled. Then the number of iterations in $k$ for each number $l$ is finite.

Proof. It follows from (26) that $\varphi_{\varepsilon_{l}}(x^{k+1})\leq\varphi_{\varepsilon_{l}}(x^{k})-\beta\gamma\mu_{k,l}^{2}$ , but $\varphi^{*}_{\varepsilon}>-\infty$ , hence $\lim\limits_{k\rightarrow\infty}\mu_{k,l}=0$ , and the result follows. $\Box$

The next property enables us to evaluate the approximation error.

Lemma 7

Suppose that (A3) is fulfilled. Fix any $l$ . Then

[TABLE]

for any $k=0,1,\ldots$

Proof. Since $\varphi_{\varepsilon_{l}}$ is strongly convex with modulus $\varepsilon_{l}$ , we have

[TABLE]

see e.g. [1, Chapter I, Section 2]. By definition, we have

[TABLE]

which gives (27). $\Box$

We are ready to establish the basic convergence property for (CGRM).

Theorem 3

Suppose that (A3) is fulfilled, we apply (CGRM) with (13). Then:

(i) the number of iterations in $k$ for each number $l$ is finite;

(ii) the sequence $\{w^{l}\}$ converges strongly to the point $x^{*}_{n}$ .

Proof. Assertion (i) has been obtained in Lemma 6. Fix any $l$ and denote by $k(l)$ the maximal value of the index $k$ for this $l$ . Then $\mu_{k(l),l}\leq\delta_{l}$ and (27) gives

[TABLE]

hence, by (13),

[TABLE]

Due to Proposition 1 (ii), $\{z(\varepsilon_{l})\}$ converges strongly to $x^{*}_{n}$ . Therefore, assertion (ii) is also true. $\Box$

We also notice that rule (13) is clearly less restrictive than (4) and maintains significant freedom for the choice of parameters.

Due to Proposition 4, the total number of iterations $N(\alpha)$ of the conditional gradient method that is necessary for attaining some prescribed accuracy $\alpha>0$ is estimated as follows:

[TABLE]

We intend to obtain a similar estimate for (CGRM). As above in Section 4, we define the complexity of (CGRM), denoted by $N(\alpha)$ , as the total number of iterations in $k$ that is necessary for attaining any given accuracy $\alpha>0$ .

Denote by $N_{(l)}$ the total number of iterations in $k$ for any fixed $l$ in (CGRM) and by $l(\alpha)$ the maximal number $l$ of the upper iteration such that $\alpha\leq\Delta(w^{l})$ for any given $\alpha>0$ . Then we can evaluate the complexity of (CGRM) as follows:

[TABLE]

cf. (18). Using this inequality, we now obtain the basic estimate. Its substantiation is somewhat different from the proof of Theorem 2.

Theorem 4

Suppose that (A3) is fulfilled, we apply (CGRM) with (19). Then (CGRM) has the complexity estimate

[TABLE]

where $C_{1}=\varepsilon_{0}^{1+2\sigma}+0.5\varepsilon_{0}\|x^{*}_{n}\|^{2}$ and $C_{2}=C_{1}/(\beta\gamma\varepsilon^{2(1+\sigma)}_{0})$ .

Proof. First we note that (19) implies (13), hence all the assertions of Theorem 3 remain true. Fix any $l$ . Then, due to (26) and Lemma 5, we have

[TABLE]

therefore,

[TABLE]

However,

[TABLE]

From (27) we have

[TABLE]

whereas (16) and (17) give

[TABLE]

Therefore,

[TABLE]

where

[TABLE]

Using these relations in (30) we have

[TABLE]

where

[TABLE]

In view of (29) and (32) we obtain

[TABLE]

We now proceed to evaluate $\nu^{-l(\alpha)}$ . By definition,

[TABLE]

Applying (15) with $\mu=0$ and $\eta=\varepsilon_{l}$ gives

[TABLE]

From (31) it now follows that

[TABLE]

In view of (19) we have

[TABLE]

It follows that $\nu^{-l(\alpha)}\leq C_{1}/\alpha$ . Applying this inequality in (33) we obtain

[TABLE]

and the result follows. $\Box$

From Theorem 4 we conclude that the complexity estimate of (CGRM) tends to (28) when $\sigma\to 0$ . Due to (19), we can choose $\sigma$ arbitrarily in $(0,1]$ . Therefore, taking $\sigma$ small enough, we can obtain any approximation of the best convergence rate of the usual conditional gradient method under the same assumptions. At the same time, (CGRM) attains the strong convergence.

6 Conclusions

We suggested simple implementable versions of the combined regularization and gradient methods for smooth convex optimization problems in Hilbert spaces. We took the basic conditional gradient and gradient projection methods and proved strong convergence of their modified versions under rather mild rules for the choice of the parameters. Within these rules we also established complexity estimates for the methods. They show that this way of incorporating the regularization techniques gives the convergence rate similar to that of the custom method, which provides only weak convergence under the same assumptions.

Acknowledgement

This work was supported by the RFBR grant, project No. 13-01-00368-a.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F.P. Vasil’yev, Methods for Solving Extremal Problems , Nauka, Moscow, 1981.
2[2] A.B. Bakushinskii and A.V. Goncharskii, Iterative Solution Methods for Ill-Posed Problems , Nauka, Moscow, 1989.
3[3] V.V. Vasin and A.L. Ageev, Incorrect Problems with A Priori Information , Nauka, Ekaterinburg, 1993.
4[4] H.W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems , Kluwer Academic Publishers, Dordrecht, 1996.
5[5] A.N. Tikhonov, On the solution of ill-posed problems and regularization method , Dokl. Akad. Nauk SSSR, vol. 151 (1963), pp.501–504.
6[6] A.B. Bakushinskii and B.T. Polyak, On the solution of variational inequalities , Sov. Math. Dokl., vol. 15 (1974), pp.1705–1710.
7[7] E.S. Levitin and B.T. Polyak, Constrained minimization methods , USSR Comp. Maths. Math. Phys., vol. 6 (1966), pp.1–50.
8[8] V.F. Dem’yanov and A.M. Rubinov, Approximate Methods for Solving Extremum Problems , Leningrad Univ. Press, Leningrad, 1968. (Engl. transl. in Elsevier, Amsterdam, 1970)