The condition number of a function relative to a set

David H. Gutman; Javier F. Pena

arXiv:1901.08359·math.OC·April 21, 2020·Math. Program.

The condition number of a function relative to a set

David H. Gutman, Javier F. Pena

PDF

TL;DR

This paper introduces a new concept of a relative condition number for convex functions with respect to a set, extending classical notions to constrained optimization and providing bounds and characterizations for specific function-set pairs.

Contribution

The paper defines a relative condition number for convex functions relative to a set, generalizing classical condition numbers and analyzing its properties and bounds in specific cases.

Findings

01

The relative condition number extends classical properties and characterizations.

02

Bounds are provided for functions of the form f = g ∘ A relative to convex sets.

03

The relative condition number influences the convergence analysis of first-order methods.

Abstract

The condition number of a differentiable convex function, namely the ratio of its smoothness to strong convexity constants, is closely tied to fundamental properties of the function. In particular, the condition number of a quadratic convex function is the square of the aspect ratio of a canonical ellipsoid associated to the function. Furthermore, the condition number of a function bounds the linear rate of convergence of the gradient descent algorithm for unconstrained convex minimization. We propose a condition number of a differentiable convex function relative to a reference convex set and distance function pair. This relative condition number is defined as the ratio of a relative smoothness to a relative strong convexity constants. We show that the relative condition number extends the main properties of the traditional condition number both in terms of its geometric insight and…

Tables1

Table 1. Table 1: Index of symbols introduced in the paper

Symbol	Section	Equation
$Z_{f, X} (y)$	2	(3)
$L_{f, X, D}$ and $μ_{f, X, D}$	2.1	(8)
$L_{f}$ and $μ_{f}$	2.1	(9)
$μ_{f, X, D}^{⋆}$ and $μ_{f, X, D}^{♯}$	2.2	(12)
$Z_{A, X} (y)$	3	(13)
$A \| C$ and ${(A \| C)}^{- 1}$	3	(14) and (15)
$∥ A \| C ∥$ and $∥ {(A \| C)}^{- 1} ∥$	3	(16)
$𝒯 (A \| X)$	3.2	(22)
$Φ (A)$ and $𝖽𝗂𝖺𝗆 (A)$	3.2	(25) and (26)
$𝒯 (A \| X, S)$	4.1	(32)

Equations352

\frac{L _{f}}{μ _{f}} = ∥ A^{T} A ∥ \cdot ∥ (A^{T} A)^{- 1} ∥ = (∥ A ∥ \cdot ∥ A^{- 1} ∥)^{2} .

\frac{L _{f}}{μ _{f}} = ∥ A^{T} A ∥ \cdot ∥ (A^{T} A)^{- 1} ∥ = (∥ A ∥ \cdot ∥ A^{- 1} ∥)^{2} .

f^{⋆} := x \in R^{m} min f (x) .

f^{⋆} := x \in R^{m} min f (x) .

∥ X^{⋆} - x_{k} ∥_{2}^{2} \leq (1 - \frac{μ _{f}}{L _{f}})^{k} ∥ X^{⋆} - x_{0} ∥_{2}^{2}

∥ X^{⋆} - x_{k} ∥_{2}^{2} \leq (1 - \frac{μ _{f}}{L _{f}})^{k} ∥ X^{⋆} - x_{0} ∥_{2}^{2}

f (x_{k}) - f^{⋆} \leq \frac{L _{f}}{2} (1 - \frac{μ _{f}}{L _{f}})^{k} ∥ X^{⋆} - x_{0} ∥_{2}^{2},

f (x_{k}) - f^{⋆} \leq \frac{L _{f}}{2} (1 - \frac{μ _{f}}{L _{f}})^{k} ∥ X^{⋆} - x_{0} ∥_{2}^{2},

f^{⋆} := x \in X min f (x) .

f^{⋆} := x \in X min f (x) .

f (x_{k}) - f^{⋆} \leq L_{f, X, D_{h}} (1 - \frac{μ _{f, X, D_{h}}^{⋆}}{L _{f, X, D_{h}}})^{k} D_{h} (x^{⋆}, x_{0})

f (x_{k}) - f^{⋆} \leq L_{f, X, D_{h}} (1 - \frac{μ _{f, X, D_{h}}^{⋆}}{L _{f, X, D_{h}}})^{k} D_{h} (x^{⋆}, x_{0})

f (x_{k}) - f^{⋆} \leq (1 - \frac{μ _{f, X, R}^{⋆}}{L _{f, X, R}})^{k} (f (x_{0}) - f^{⋆}) .

f (x_{k}) - f^{⋆} \leq (1 - \frac{μ _{f, X, R}^{⋆}}{L _{f, X, R}})^{k} (f (x_{0}) - f^{⋆}) .

f (x_{k}) - f^{⋆} \leq (1 - min {\frac{1}{2}, \frac{μ _{f, X, G}^{⋆}}{4 L _{f, X, D}}})^{k /2} (f (x_{0}) - f^{⋆}) .

f (x_{k}) - f^{⋆} \leq (1 - min {\frac{1}{2}, \frac{μ _{f, X, G}^{⋆}}{4 L _{f, X, D}}})^{k /2} (f (x_{0}) - f^{⋆}) .

D_{h} (y, x) := h (y) - h (x) - ⟨ \nabla h (x), y - x ⟩ .

D_{h} (y, x) := h (y) - h (x) - ⟨ \nabla h (x), y - x ⟩ .

D (y, x) := \frac{1}{2} ∥ y - x ∥^{2} .

D (y, x) := \frac{1}{2} ∥ y - x ∥^{2} .

r (y, x) := in f {ρ > 0 : y - x = ρ \cdot (u - x) for some u \in X} .

r (y, x) := in f {ρ > 0 : y - x = ρ \cdot (u - x) for some u \in X} .

d (y, x) := in f {δ > 0 : y - x = δ \cdot (u - v) for some u, v \in X} .

d (y, x) := in f {δ > 0 : y - x = δ \cdot (u - v) for some u, v \in X} .

Z_{f, X} (y) := {x \in X : f (x) = f (y) and ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ = 0} .

Z_{f, X} (y) := {x \in X : f (x) = f (y) and ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ = 0} .

Z_{f, X} (y) = {x \in X : f (x + λ (y - x)) = f (y) for all λ \in [0, 1]} .

Z_{f, X} (y) = {x \in X : f (x + λ (y - x)) = f (y) for all λ \in [0, 1]} .

f (x) := y \in B^{n} min ∥ x - y ∥_{2}^{2},

f (x) := y \in B^{n} min ∥ x - y ∥_{2}^{2},

Z_{f,X}(y)=\left\{\begin{array}[]{ll}\{y\}&\text{if }y\not\in{\mathbb{B}}^{n}\\ {\mathbb{B}}^{n}&\text{if }y\in{\mathbb{B}}^{n}.\end{array}\right.

Z_{f,X}(y)=\left\{\begin{array}[]{ll}\{y\}&\text{if }y\not\in{\mathbb{B}}^{n}\\ {\mathbb{B}}^{n}&\text{if }y\in{\mathbb{B}}^{n}.\end{array}\right.

D_{f} (y, x) = f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ .

D_{f} (y, x) = f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ .

D_{f} (y, x) \leq L D (y, x) for all x, y \in dom (f) .

D_{f} (y, x) \leq L D (y, x) for all x, y \in dom (f) .

D_{f} (y, x) \geq μ D (y, x) for all x, y \in dom (f) .

D_{f} (y, x) \geq μ D (y, x) for all x, y \in dom (f) .

D_{f} (y, x) \leq L D (y, x) for all x, y \in X .

D_{f} (y, x) \leq L D (y, x) for all x, y \in X .

D_{f} (Z_{f, X} (y), x) \geq μ D (Z_{f, X} (y), x) for all x, y \in X .

D_{f} (Z_{f, X} (y), x) \geq μ D (Z_{f, X} (y), x) for all x, y \in X .

L_{f, X, D} := in f {L > 0 : \eqref e q . s m oo t h . r e l holds}, μ_{f, X, D} := sup {μ \geq 0 : \eqref e q . s t r o n g . co n v . r e l holds} .

L_{f, X, D} := in f {L > 0 : \eqref e q . s m oo t h . r e l holds}, μ_{f, X, D} := sup {μ \geq 0 : \eqref e q . s t r o n g . co n v . r e l holds} .

L_{f} := in f {L > 0 : \eqref e q . s m oo t h holds}, μ_{f} := sup {μ \geq 0 : \eqref e q . s t r o n g . co n v holds} .

L_{f} := in f {L > 0 : \eqref e q . s m oo t h holds}, μ_{f} := sup {μ \geq 0 : \eqref e q . s t r o n g . co n v holds} .

μ_{f} = ϵ^{2} ≪ 1 = μ_{f, X, D} = L_{f, X, D} ≪ M^{2} = L_{f} .

μ_{f} = ϵ^{2} ≪ 1 = μ_{f, X, D} = L_{f, X, D} ≪ M^{2} = L_{f} .

μ_{f, X, D} = (max {r : r B^{m} \subseteq A (B^{n} \cap R_{+}^{n})})^{2},

μ_{f, X, D} = (max {r : r B^{m} \subseteq A (B^{n} \cap R_{+}^{n})})^{2},

μ_{f, X, D} = 2 ϵ^{2} ≪ 1 + 2 ϵ^{2} = σ_{m i n} (A)^{2} = μ_{f} .

μ_{f, X, D} = 2 ϵ^{2} ≪ 1 + 2 ϵ^{2} = σ_{m i n} (A)^{2} = μ_{f} .

D_{f} (\overset{x}{ˉ}, x) \geq μ D (\overset{x}{ˉ}, x) for all x \in X .

D_{f} (\overset{x}{ˉ}, x) \geq μ D (\overset{x}{ˉ}, x) for all x \in X .

f (x) - f^{⋆} \geq μ D (\overset{x}{ˉ}, x) for all x \in X .

f (x) - f^{⋆} \geq μ D (\overset{x}{ˉ}, x) for all x \in X .

μ_{f, X, D}^{⋆} := sup {μ \geq 0 : \eqref e q . q u a s i . s t r o n g . co n v e x . r e l holds}, μ_{f, X, D}^{♯} := sup {μ \geq 0 : \eqref e q . f u n c . g r o w t h holds} .

μ_{f, X, D}^{⋆} := sup {μ \geq 0 : \eqref e q . q u a s i . s t r o n g . co n v e x . r e l holds}, μ_{f, X, D}^{♯} := sup {μ \geq 0 : \eqref e q . f u n c . g r o w t h holds} .

D_{f} (\overset{x}{ˉ}, x) \geq D_{f} (Z_{f, X} (\overset{x}{ˉ}), x) \geq μ D (Z_{f, X} (\overset{x}{ˉ}), x) = μ D (\overset{x}{ˉ}, x) for all x \in X .

D_{f} (\overset{x}{ˉ}, x) \geq D_{f} (Z_{f, X} (\overset{x}{ˉ}), x) \geq μ D (Z_{f, X} (\overset{x}{ˉ}), x) = μ D (\overset{x}{ˉ}, x) for all x \in X .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The condition number of a function relative to a set

David H. Gutman Department of Industrial, Manufacturing, and Systems Engineering, Texas Tech University, USA, [email protected]

Javier F. Peña Tepper School of Business, Carnegie Mellon University, USA, [email protected]

Abstract

The condition number of a differentiable convex function, namely the ratio of its smoothness to strong convexity constants, is closely tied to fundamental properties of the function. In particular, the condition number of a quadratic convex function is the square of the aspect ratio of a canonical ellipsoid associated to the function. Furthermore, the condition number of a function bounds the linear rate of convergence of the gradient descent algorithm for unconstrained convex minimization.

We propose a condition number of a differentiable convex function relative to a reference convex set and distance function pair. This relative condition number is defined as the ratio of relative smoothness to relative strong convexity constants. We show that the relative condition number extends the main properties of the traditional condition number both in terms of its geometric insight and in terms of its role in characterizing the linear convergence of first-order methods for constrained convex minimization.

When the reference set $X$ is a convex cone or a polyhedron and the function $f$ is of the form $f=g\circ A$ , we provide characterizations of and bounds on the condition number of $f$ relative to $X$ in terms of the usual condition number of $g$ and a suitable condition number of the pair $(A,X)$ .

1 Introduction

Let $f:{\mathbb{R}}^{m}\to{\mathbb{R}}\cup\{\infty\}$ be a convex differentiable function. The condition number of $f$ is the ratio $L_{f}/\mu_{f}$ where $L_{f}$ and $\mu_{f}$ are respectively the smoothness and strong convexity constants of the function $f$ . See Definition 1 and equation (9) below. The condition number $L_{f}/\mu_{f}$ is closely tied to a number of fundamental properties of the function $f$ . In the special case when $f$ is a quadratic convex function the condition number has the following geometric interpretation. Suppose $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ where $A\in{\mathbb{R}}^{n\times n}$ is non-singular. Then the condition number of $f$ is

[TABLE]

The latter quantity is the square of the aspect ratio of the ellipsoid $A({\mathbb{B}}^{n}):=\{Ax:x\in{\mathbb{R}}^{n},\|x\|_{2}\leq 1\}$ since $\|A\|$ and $1/\|A^{-1}\|$ are respectively the radius of the smallest ball that contains $A({\mathbb{B}}^{n})$ and the radius of the largest ball contained in $A({\mathbb{B}}^{n})$ .

The condition number $L_{f}/\mu_{f}$ also bounds the linear convergence rate of the gradient descent algorithm for the unconstrained minimization problem

[TABLE]

More precisely, for a suitable choice of step sizes the iterates $x_{k},\;k=0,1,\dots$ generated by the gradient descent algorithm satisfy

[TABLE]

and

[TABLE]

where $X^{\star}:=\{x\in{\mathbb{R}}^{n}:f(x)=f^{\star}\}$ and $\|X^{\star}-x\|_{2}=\inf_{y\in X^{\star}}\|y-x\|_{2}$ . The articles [4, 8, 17, 21, 22, 23, 24], among others, discuss the above type of linear convergence and a number of interesting related developments. In particular, Necoara, Nesterov and Glineur [22] establish linear convergence properties for a wide class of first-order methods under assumptions that are relaxations of strong convexity.

Let $f:{\mathbb{R}}^{m}\to{\mathbb{R}}\cup\{\infty\}$ be a convex differentiable function, $X\subseteq{\mathrm{dom}}(f)$ be a convex set, and $D:X\times X\rightarrow{\mathbb{R}}_{+}$ be a distance-like function, that is, $D(y,x)\geq 0$ and $D(x,x)=0$ for all $x,y\in X$ . We propose a relative smoothness constant $L_{f,X,D}$ and a relative strong convexity constant $\mu_{f,X,D}$ of the function $f$ relative to the pair $(X,D)$ . See Definition 2 and equation (8) below for details. We show that the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ extends the above properties of the traditional condition number $L_{f}/\mu_{f}$ both in terms of its geometric insight and in terms of its role in characterizing the linear convergence of first-order methods for the constrained convex minimization problem

[TABLE]

As Example 1 illustrates, the relative condition number depends on the combination of the constraint set $X$ and the function $f$ . In particular, Example 1 shows that the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ can be vastly different (both smaller or larger) than the usual condition number $L_{f}/\mu_{f}$ depending on how the shape of $X$ fits $f$ . Example 1 also shows that $\mu_{f,X,D}$ can be strictly positive in cases when $\mu_{f}=0$ . Our main results highlight deeper connections between the relative constants and geometric features of the set $X$ . In particular, when $f=g\circ A$ for some matrix $A\in{\mathbb{R}}^{m\times n}$ and $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ , and $X$ is conic or polyhedral, we provide characterizations of and bounds on $L_{f,X,D}$ and $\mu_{f,X,D}$ in terms of $L_{g}$ and $\mu_{g}$ and some condition properties of the pair $(A,X)$ .

We show that the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ and some related quantities readily yield linear convergence rates for the mirror descent, Frank-Wolfe, and Frank-Wolfe with away steps algorithms for the constrained minimization problem (2). We should note that these linear convergence properties have been previously established in [3, 2, 18, 20, 13, 22, 24, 28, 32] under various kinds of assumptions. Our approach shows that all of these linear convergence results hinge on a similar type of relative conditioning. Our approach also reveals that several linear convergence results can be sharpened. We show that the linear convergence of the mirror descent algorithm (Proposition 6 and Proposition 7) holds for a sharper rate and under more general assumptions than those in [20, 32]. More precisely, Proposition 6 and Proposition 7 show that linear convergence holds under new conditions of relative quasi-strong convexity and relative functional growth that are typically weaker than the type of relative strong convexity assumed in [20, 32]. In contrast to the previous results in [3, 13], our linear convergence result for the Frank-Wolfe algorithm (Proposition 8) is stated in terms of an affine invariant relative condition number defined via a natural radial distance function. Our approach based on the relative condition number yields a proof of linear convergence for the Frank-Wolfe with away steps algorithm that is significantly shorter, simpler, and at least as sharp as or sharper than the ones previously presented in [2, 18, 28]. Unlike previous approaches, our proof of linear convergence of the Frank-Wolfe with away steps algorithm (Proposition 9) highlights some similarities with the proof of linear convergence of the regular Frank-Wolfe algorithm (Proposition 8). Like the results presented in [18, Appendix C and D], the linear convergence of the Frank-Wolfe with away steps algorithm (Proposition 9) is stated in terms of an affine invariant relative condition number.

The relative constants $L_{f,X,D}$ and $\mu_{f,X,D}$ are defined globally. In particular, they do not depend on any specific point in $X$ . We consider several variants of relative strong convexity following the constructions of Necoara, Nesterov and Glineur [22]. In particular, we define a relative quasi-strong convexity constant $\mu_{f,X,D}^{\star}$ and a relative functional growth constant $\mu_{f,X,D}^{\sharp}$ . See Definition 3 and equation (12). Unlike $\mu_{f,X,D}$ , the constants $\mu_{f,X,D}^{\star}$ and $\mu_{f,X,D}^{\sharp}$ depend on the set of minimizers $X^{\star}$ of $f$ on $X$ . We show that relative quasi-strong convexity is a relaxation of relative strong convexity. We also show that under suitable assumptions relative functional growth is a relaxation of relative quasi-strong convexity. Not surprisingly, there are classes of non-strongly convex functions for which the constant $\mu_{f,X,D}^{\sharp}$ is positive while $\mu_{f,X,D}$ and $\mu_{f,X,D}^{\star}$ may not be. (See Theorem 4.)

Our work draws on and connects several seemingly unrelated threads of research on first-order methods [1, 2, 18, 20, 22, 28, 32] and on condition measures for convex optimization [10, 9, 12, 11, 19, 25, 27, 30, 31]. Our construction of $L_{f,X,D}$ and $\mu_{f,X,D}$ is inspired by and closely related to the work of Lu, Freund, and Nesterov [20] and of Bauschke, Bolte, and Teboulle [1, 32]. Lu et al. [20] extend the concepts of smoothness and strong convexity constants by considering them relative to a reference function $h$ , see [20, Definition 1.1 and 1.2]. Our construction is identical to theirs in the special case when the distance function is the Bregman distance function $D_{h}$ associated to a reference function $h$ and the function $f$ is strictly convex. Bauschke, Bolte, and Teboulle [1] define a concept of Lipschitz-like condition that is equivalent to smoothness relative to a reference function. As we detail in Section 5, our relative constants $L_{f,X,D}$ and $\mu_{f,X,D}$ are also identical to the curvature constant, away curvature constant and geometric strong convexity constant proposed by Jaggi [16] and by Lacoste-Julien and Jaggi in [18, Appendix C] for properly chosen distance-like functions $D$ . Our constructions of relative functional growth and relative quasi strong convexity are natural extensions of analogous concepts proposed by Necoara, Nesterov, and Glineur [22] to unveil relaxations of strong convexity that ensure the linear convergence of first-order methods. Our relative functional growth concept is in the same spirit as that of the quadratic functional growth approach used by Beck and Shtern [2] to established the linear convergence of a conditional gradient algorithm with away steps for non-strongly convex functions.

In contrast to the approaches in [2, 18, 20, 22, 28], our construction of the relative condition constants applies to any pair $(X,D)$ of reference set and distance function. Our main results (Section 3 and Section 4) reveal some interesting insights when $D$ is bounded by a squared norm. We establish a close connection between our relative conditioning approach and the conditioning of linear conic systems pioneered by Renegar [30, 31] and further developed by a number of authors [6, 10, 9, 12, 11, 19, 25, 27, 26]. We especially draw on ideas developed in the recent paper [26]. We note that consistent with our construction of the relative constants $L_{f,X,D},\;\mu_{f,X,D},\;\mu_{f,X,D}^{\star},\;\mu^{\sharp}_{f,X,D}$ , all of our results concerning them scale appropriately, that is, they scale by $\lambda$ whenever the objective function $f$ is replaced by $\tilde{f}=\lambda f$ for some constant $\lambda>0$ . In particular, the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ and all of our bounds on it are invariant under positive scaling of $f$ .

The main sections of the paper are organized as follows. Section 2 presents our central construction, namely relative smoothness and relative strong convexity. This section also introduces relative quasi strong convexity and relative functional growth, both of which are variants of relative strong convexity. Section 3 and Section 4 present the main technical results of the paper. Section 3 develops several properties of the constants $L_{f,X,D}$ and $\mu_{f,X,D}$ . More precisely, Proposition 2 gives an upper bound on $L_{f,X,D}$ when $f$ is of the form $g\circ A$ for some $A\in{\mathbb{R}}^{m\times n},\;g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ . Proposition 2(a) shows that the bound is tight. The more involved Theorem 1 and Theorem 2 give lower bounds on $\mu_{f,X,D}$ when $f$ is of the form $g\circ A$ and $X$ is a convex cone or a polyhedron. These bounds readily imply that for $f=g\circ A$ the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ can be bounded in terms of the product of the classical condition number $L_{g}/\mu_{g}$ and a condition number of the pair $(A,X)$ . See equation (21) and equation (24). Corollary 1 and Corollary 2 show that the bounds in Theorem 1 and Theorem 2 are tight. Section 4 develops properties analogous to those in Section 3 but for the constants $\mu^{\star}_{f,X,D}$ and $\mu^{\sharp}_{f,X,D}$ . Section 5 details linear convergence results for the mirror descent algorithm, Frank-Wolfe algorithm, and Frank-Wolfe with away steps algorithm for problem (2). In all cases the linear convergence properties are stated in terms of the relative constants $L_{f,X,D}$ and $\mu^{\star}_{f,X,D},\mu^{\sharp}_{f,X,D}$ for suitable choices of distance-like function $D$ . The main results in Section 5 can be summarized as follows. Consider the mirror descent algorithm for problem (2) with a Bregman distance $D_{h}$ associated to a reference function $h:X\rightarrow{\mathbb{R}}$ . Proposition 6 shows the following linear convergence result: if $L_{f,X,D_{h}}<\infty$ and $\mu_{f,X,D_{h}}^{\star}>0$ then the mirror descent iterates satisfy

[TABLE]

for $x^{\star}\in\operatorname*{argmin}_{x\in X}f(x)$ . Proposition 7 gives a linear convergence result of similar flavor when $\mu_{f,X,D_{h}}^{\sharp}>0$ . The rates of convergence in both Proposition 6 and Proposition 7 are at least as sharp, and possibly much sharper, than those in [20, 32] and apply to a broader class of functions. In particular, as Example 7 in Section 4 shows, there are instances where $\mu_{f,X,D}^{\sharp}>\mu_{f,X,D}=0$ occurs. In such instances Proposition 7 yields the linear convergence of mirror descent whereas the linear convergence results in [20, 32] do not apply.

Proposition 8 gives a strikingly similar linear convergence result for the Frank-Wolfe algorithm: suppose $X$ is a compact convex set endowed with a linear oracle and $L_{f,X,\mathfrak{R}}<\infty$ and $\mu_{f,X,\mathfrak{R}}^{\star}>0$ for the radial distance function $\mathfrak{R}:X\times X\rightarrow{\mathbb{R}}_{+}$ defined via (46). Proposition 8 shows that the Frank-Wolfe iterates satisfy

[TABLE]

This rate of convergence subsumes and is sharper than the previously known linear convergence results for the Frank-Wolfe algorithm in [13, 3].

Proposition 9 gives a result of similar flavor for the Frank-Wolfe with away steps algorithm: suppose $X$ is a polytope endowed with a vertex linear oracle, and $L_{f,X,\mathfrak{D}}<\infty$ and $\mu_{f,X,\mathfrak{G}}^{\star}>0$ for the distance functions $\mathfrak{D}:X\times X\rightarrow{\mathbb{R}}_{+}$ and $\mathfrak{G}:X\times X\rightarrow{\mathbb{R}}_{+}$ defined via (49) and (51). Proposition 9 shows that if the Frank-Wolfe with away steps algorithm starts from a vertex in $X$ then the subsequent iterates satisfy

[TABLE]

This rate of convergence is at least as sharp, and possible much sharper, than the rates previously shown in [2, 18, 28].

Throughout the paper we define a number of new objects that are necessary for our main developments. To help the reader recall the definition and notation associated to these new objects, Table 1 displays the section and equation where each object is defined.

2 Conditioning relative to a reference set and distance function pair

This section presents the central ideas of this paper. We introduce the concepts of relative smoothness and relative strong convexity of a function relative to a reference set and distance function pair. We also introduce some variants of relative strong convexity that are natural extensions of the approach developed by Necoara, Nesterov and Glineur [22].

Throughout the entire paper we will make the following blanket assumption about the triple $(f,X,D)$ .

Assumption 1.

The function $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is convex and differentiable. The set $X\subseteq{\mathrm{dom}}(f)$ is convex. The function $D:X\times X\rightarrow{\mathbb{R}}_{+}$ is a reference distance-like function, that is, $D(y,x)\geq 0$ for all $x,y\in X$ and $D(x,x)=0$ for all $x\in X$ .**

Throughout our developments we will consider the following classes of reference distance-like functions:

•

The Bregman distance $D_{h}:X\times X\rightarrow{\mathbb{R}}_{+}$ associated to a reference convex differentiable function $h:X\rightarrow{\mathbb{R}}$ , that is,

[TABLE]

•

The square of a (non-necessarily Euclidean) norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ , that is,

[TABLE]

•

The square $\mathfrak{R}:=\frac{\mathfrak{r}^{2}}{2}$ of the radial distance function $\mathfrak{r}:X\times X\rightarrow{\mathbb{R}}_{+}$ defined as follows

[TABLE]

Notice that the function $v\mapsto\mathfrak{r}(x+v,x)$ coincides with the gauge function of the set $X-x$ on $X-x$ . Figure 1 illustrates the level sets defined by $\mathfrak{r}(\cdot,x)$ for $X=\{x\in{\mathbb{R}}^{2}:\|x\|_{2}\leq 1\}$ .

•

The square $\mathfrak{D}:=\frac{\mathfrak{d}^{2}}{2}$ of the diametral distance function $\mathfrak{d}:X\times X\rightarrow{\mathbb{R}}_{+}$ defined as follows

[TABLE]

Figure 2 illustrates the level sets defined by the diametral distance $\mathfrak{d}(\cdot,x)$ for $X=\{x\in{\mathbb{R}}^{2}:\|x\|_{2}\leq 1\}$ .

Our main construction is based on bounding the behavior of the Bregman distance associated to $f$ in terms of the reference distance function $D$ . The following set-valued mapping $Z_{f,X}:X\rightrightarrows X$ provides a key building block for our construction. For $y\in X$ let $Z_{f,X}(y)\subseteq X$ denote the set

[TABLE]

It is easy to see that $Z_{f,X}(y)$ can also be written as

[TABLE]

Observe that if $f$ is strictly convex then $Z_{f,X}(y)=\{y\}$ for all $y\in X$ . The set $Z_{f,X}(y)$ captures the largest convex subset of $\{x\in X:f(x)=f(y)\}$ that includes $y$ and where $f$ fails to be strictly convex. In particular, when $f$ is of the form $f=g\circ A$ for $A\in{\mathbb{R}}^{m\times n}$ and $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ strictly convex, it is easy to see that $Z_{f,X}(y)=\{x\in X:Ax=Ay\}$ . We will further discuss functions of this form in Section 3 and Section 4. To illustrate the set-valued mapping $Z_{f,X}$ in a different example, consider the function $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}$ defined as

[TABLE]

where ${\mathbb{B}}^{n}=\{y\in{\mathbb{R}}^{n}:\|y\|_{2}\leq 1\}.$ In this case

[TABLE]

2.1 Relative smoothness and relative strong convexity

To motivate our main construction we first recall the classical notion of smoothness and strong convexity constants. We recall these classical concepts in a format that we subsequently use for our main construction. Recall that for a convex differentiable function $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ and $x,y\in{\mathrm{dom}}(f)$ the Bregman distance $D_{f}(y,x)$ is

[TABLE]

Definition 1.

Suppose $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is convex and differentiable and $D(y,x)=\frac{1}{2}\|y-x\|^{2}$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ .

(a)

The function $f$ is smooth for the norm $\|\cdot\|$ if there exists a constant $L>0$ such that

[TABLE]

(b)

The function $f$ is strongly convex for the norm $\|\cdot\|$ if there exists a constant $\mu>0$ such that

[TABLE]

Next, we present our main construction. In Definition 2 and throughout the paper we will use the following notational convention. For a nonempty $S\subseteq X$ and $x\in X$ let $D_{f}(S,x)$ and $D(S,x)$ denote $\inf_{y\in S}D_{f}(y,x)$ and $\inf_{y\in S}D(y,x)$ respectively.

Definition 2.

Let $(f,X,D)$ satisfy Assumption 1.

(a)

We say that $f$ is smooth relative to $(X,D)$ if there exists a constant $L>0$ such that

[TABLE]

(b)

We say that $f$ is strongly convex relative to $(X,D)$ if there exists a constant $\mu>0$ such that

[TABLE]

When $D=D_{h}$ for some convex differentiable function $h:X\rightarrow{\mathbb{R}}$ , the above relative smoothness concept is identical to the smoothness of $f$ relative to $h$ on $X$ as defined in [20]. The latter in turn is equivalent to the Lipschitz-like condition defined in [1]. Furthermore, when $D=D_{h}$ and $f$ is strictly convex, the above relative strong convexity concept is identical to the strong convexity of $f$ relative to $h$ on $X$ as defined in [20]. We note that as in [20], the above definitions (6) and (7) are not symmetric in $x$ and $y$ since they depend on $D_{h}$ and $D$ which are not necessarily symmetric. Observe that the term $Z_{f,X}(y)$ instead of $y$ in (7) makes this definition of relative strong convexity less stringent than the classical one (5) or the one in [20]. This is a key feature of our construction.

We will use the following notation throughout the rest of the paper. Suppose $(f,X,D)$ satisfies Assumption 1. Let $L_{f,X,D}$ and $\mu_{f,X,D}$ be the following relative smoothness and strong convexity constants

[TABLE]

In addition, suppose $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is convex and differentiable and $D(y,x)=\frac{1}{2}\|y-x\|^{2}$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ . Let $L_{f}$ and $\mu_{f}$ be the following classical smoothness and strong convexity constants

[TABLE]

The following example illustrates the values of the relative smoothness and strong convexity constants $L_{f,X,D}$ and $\mu_{f,X,D}$ of a convex quadratic function relative to $(X,D)$ for some canonical choices of $f,X,$ and $D$ . Example 1 highlights that the relative constants $L_{f,X,D}$ and $\mu_{f,X,D}$ depend on the combination of the constraint set $X$ and the function $f$ . In particular, Example 1 shows that the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ can be vastly different (both smaller or larger) than the usual condition number $L_{f}/\mu_{f}$ depending on how the shape of $X$ fits $f$ . Example 1 also lays the ground for the main properties that we develop in Section 3.

Example 1.

Let $A\in{\mathbb{R}}^{m\times n},b\in{\mathbb{R}}^{m}$ with $A\neq 0$ and ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{n}$ be endowed with the Euclidean norm. Let $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ and $D(y,x)=\frac{1}{2}\|y-x\|_{2}^{2}.$ Then $f$ has the following smoothness and strong convexity constants $L_{f,X,D}$ and $\mu_{f,X,D}$ relative to $(X,D)$ for some particular choices of $X$ .

(a)

For $X={\mathbb{R}}^{n}$ we have $L_{f,X,D}=\sigma_{\max}(A^{\text{\sf T}}A)=\sigma_{\max}(A)^{2}$ and $\mu_{f,X,D}=\sigma_{\min}^{+}(A^{\text{\sf T}}A)=\sigma_{\min}^{+}(A)^{2}>0$ , where $\sigma_{\min}^{+}(\cdot)$ denotes the smallest positive singular value. Observe that in this case $L_{f}=L_{f,X,D}$ but $\mu_{f}=\mu_{f,X,D}$ only when $A$ is full column rank.

(b)

Suppose $X\subseteq{\mathbb{R}}^{n}$ is a linear subspace such that the mapping $A|X:X\rightarrow{\mathbb{R}}^{m}$ defined via $x\in X\mapsto Ax\in{\mathbb{R}}^{m}$ is nonzero. Then $L_{f,X,D}=\sigma_{\max}(A|X)^{2}$ and $\mu_{f,X,D}=\sigma_{\min}^{+}(A|X)^{2}$ . Observe that in this case $L_{f,X,D}\leq L_{f}$ and $L_{f,X,D}$ can be quite a bit smaller. Likewise, $\mu_{f,X,D}\geq\mu_{f}$ and $\mu_{f,X,D}$ can be quite a bit larger.

For instance, suppose $A=\text{diag}(I_{n-2},M,\epsilon)\in{\mathbb{R}}^{n\times n}$ for some positive $M,\epsilon$ with $0<\epsilon\ll 1\ll M$ . If $X={\mathbb{R}}^{n-2}\times\{0_{2}\}\subseteq{\mathbb{R}}^{n}$ then

[TABLE]

In this case we have $L_{f,X,D}/\mu_{f,X,D}\ll L_{f}/\mu_{f}$ .

(c)

Suppose $X={\mathbb{R}}^{n}_{+}$ . In this case $L_{f,X,D}=\|A\|^{2}=\sigma_{\max}(A^{\text{\sf T}}A)=L_{f}$ . On the other hand, if $A({\mathbb{R}}^{n}_{+})={\mathbb{R}}^{m}$ then $\mu_{f,X,D}$ is the following kind of squared signed smallest singular value of $A$

[TABLE]

where ${\mathbb{B}}^{m}$ and ${\mathbb{B}}^{n}$ denote the unit balls in ${\mathbb{R}}^{m}$ and ${\mathbb{R}}^{n}$ respectively. In other words, $\mu_{f,X,D}$ is the square of the radius of the largest ball centered at zero and contained in $A({\mathbb{B}}^{n}\cap{\mathbb{R}}^{n}_{+})$ . Observe that if $X={\mathbb{R}}^{n}_{+}$ and $A({\mathbb{R}}^{n}_{+})={\mathbb{R}}^{m}$ then $0<\mu_{f,X,D}\leq\sigma_{\min}(A)^{2}$ and $\mu_{f,X,D}$ can be quite a bit smaller. For instance, if $A=\begin{bmatrix}1&-1&0\\ -\epsilon&-\epsilon&1\end{bmatrix}$ for $0<\epsilon\ll 1$ then

[TABLE]

In this case we have $L_{f,X,D}/\mu_{f,X,D}\gg L_{f}/\mu_{f}$ .

The statements (a), (b), and (c) in Example 1 can be verified directly but they also follow from the more general Proposition 2, Corollary 1, and Corollary 2 in Section 3 below.

2.2 Relative quasi strong convexity and relative functional growth

Following [22], we next consider two variants of relative strong convexity that are natural extensions of the quasi-strong convexity and quadratic functional growth concepts defined in [22]. For that purpose, we will rely on the following strengthening of Assumption 1.

Assumption 2.

Suppose $(f,X,D)$ satisfy Assumption 1, $f^{\star}:=\min_{x\in X}f(x)$ is finite, $X^{\star}:=\{x\in X:f(x)=f^{\star}\}\neq\emptyset$ , and the map $x\mapsto\bar{x}:=\operatorname*{argmin}_{y\in X^{\star}}D(y,x)$ is well defined for all $x\in X$ . **

Definition 3.

Suppose $(f,X,D)$ satisfies Assumption 2.

(a)

We say that $f$ is quasi-strongly-convex relative to $(X,D)$ if there exists a constant $\mu>0$ such that

[TABLE]

(b)

We say that $f$ has $D$ -relative functional growth on $X$ if there exists a constant $\mu>0$ such that

[TABLE]

Throughout the sequel we will use the following notation analogous to (8). Suppose $(f,X,D)$ satisfies Assumption 2. Let $\mu_{f,X,D}^{\star}$ and $\mu_{f,X,D}^{\sharp}$ be as follows

[TABLE]

The next proposition shows that, as one may intuitively expect, relative quasi-strong convexity is a relaxation of relative strong convexity. In other words, $\mu_{f,X,D}\leq\mu_{f,X,D}^{\star}$ whenever $(f,X,D)$ satisfies Assumption 2.

Proposition 1.

Suppose $(f,X,D)$ satisfy Assumption 2. If $\mu>0$ is such that $(f,X,D,\mu)$ satisfies (7) then $(f,X,D,\mu)$ satisfies (10).

Proof.

The construction of $Z_{f,X}(y)$ implies that $Z_{f,X}(y)=X^{\star}$ for all $y\in X^{\star}$ . Therefore, if $(f,X,D,\mu)$ satisfies (7) then by taking $y=\bar{x}$ it follows that

[TABLE]

∎

The following simple example shows that, perhaps contrary to what one might intuitively expect, relative functional growth is not necessarily a relaxation of strong relative convexity unless some additional assumptions are made about $f,X,$ or $D$ .

Example 2.

Let $a>0$ and $f:{\mathbb{R}}\rightarrow{\mathbb{R}}$ be the function $f(x)=e^{ax}$ . For $X:={\mathbb{R}}_{+}$ we have $X^{\star}=\{0\}$ . Thus for $D:=D_{f}$ and $\mu=1$ the tuple $(f,X,D,\mu)$ satisfies (7). However, observe that for all $\hat{\mu}>0$ and $x\geq 1/(\hat{\mu}a)$

[TABLE]

In particular, $(f,X,D,\hat{\mu})$ does not satisfy (11) for any $\hat{\mu}>0$ . **

It can be shown that under additional assumptions on $f,X,$ or $D$ the relative functional growth condition is a relaxation of the relative strong convexity condition. In particular, relative functional growth is a relaxation of relative strong convexity when $D$ is a squared norm as we discuss in Section 4 below.

3 Properties of $L_{f,X,D}$ and $\mu_{f,X,D}$ when

$f$ is of the form $g\circ A$

This section develops some properties of the relative constants $L_{f,X,D}$ and $\mu_{f,X,D}$ when $f$ is of the form $f:=g\circ A$ for $A\in{\mathbb{R}}^{m\times n}$ and $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\},$ and $D$ is bounded in terms of some norm in ${\mathbb{R}}^{n}$ . The main results of this section are Theorem 1 and Theorem 2. These results provide lower bounds on $\mu_{f,X,D}$ in terms of $\mu_{g}$ and the norms of some canonical set-valued mappings that depend on $A$ and $X$ . In a similar vein, Proposition 2 gives an upper bound on $L_{f,X,D}$ in terms of $L_{g}$ and the norm of a canonical mapping associated to $A$ and $X$ .

We will rely on the objects $Z_{A,X}(\cdot)$ and $A|C,(A|C)^{-1}$ defined next. For $A\in{\mathbb{R}}^{m\times n},\;X\subseteq{\mathbb{R}}^{n}$ nonempty and $y\in X$ let

[TABLE]

The set-valued mapping $Z_{A,X}:X\rightrightarrows X$ can be seen as an extension of the set-valued mapping $Z_{f,X}:X\rightrightarrows X$ introduced in Section 2.1.

For $A\in{\mathbb{R}}^{m\times n}$ and a convex cone $C\subseteq{\mathbb{R}}^{n}$ let $A|C:{\mathbb{R}}^{n}\rightrightarrows{\mathbb{R}}^{m}$ be the set-valued mapping defined via

[TABLE]

and let $(A|C)^{-1}:{\mathbb{R}}^{m}\rightrightarrows{\mathbb{R}}^{n}$ be its inverse, that is,

[TABLE]

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms. Define the norms of $A|C$ and of $(A|C)^{-1}$ as follows

[TABLE]

Observe that if $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ is a convex set that contains more than one point then

[TABLE]

where $\operatorname{span}(X-X)$ denotes the linear subspace spanned by $X-X$ , that is,

[TABLE]

In particular, the following property of the relative smoothness constant readily follows.

Proposition 2.

Let $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ be a convex set that contains more than one point.

(a)

If ${\mathbb{R}}^{m}$ is endowed with the Euclidean norm, $D(y,x)=\frac{1}{2}\|y-x\|^{2}$ for some norm in ${\mathbb{R}}^{n}$ , and $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $b\in{\mathbb{R}}^{m}$ then

[TABLE]

(b)

Suppose ${\mathbb{R}}^{m},{\mathbb{R}}^{n}$ are endowed with norms and $D(y,x)\geq\frac{1}{2}\|y-x\|^{2}$ for the norm in ${\mathbb{R}}^{n}$ . If $f=g\circ A$ where $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is $L_{g}$ smooth for the norm in ${\mathbb{R}}^{m}$ then

[TABLE]

Proof.

(a)

This follows from (17) and $D_{f}(y,x)=\frac{1}{2}\|Ay-Ax\|^{2}_{2}$ .

(b)

This follows from (17) and $D_{f}(y,x)=D_{g}(Ay,Ax)\leq\frac{L_{g}}{2}\|Ay-Ax\|^{2}$ . The latter inequality follows from the $L_{g}$ smoothness of $g$ .

∎

We next discuss far more interesting results that either characterize or lower bound the relative strong convexity constant $\mu_{f,X,D}$ .

3.1 Lower bound on $\mu_{f,X,D}$ when $X$ is a convex cone and $A(X)$ is a linear subspace

In this subsection we will consider the special case when $X\subseteq{\mathbb{R}}^{n}$ is a convex cone and $A\in{\mathbb{R}}^{m\times n}$ is such that $A(X)$ is a linear subspace of ${\mathbb{R}}^{m}$ . The latter condition is equivalent to the following Slater condition: there exists $x\in\mathsf{ri}(X)$ such that $Ax=0$ , where $\mathsf{ri}(X)$ denotes the relative interior of $X$ . When this is the case, the norms $\|A|X\|$ and $\|(A|X)^{-1}\|$ have the following geometric interpretation. Let ${\mathbb{B}}^{m}$ and ${\mathbb{B}}^{n}$ denote the unit balls in ${\mathbb{R}}^{m}$ and ${\mathbb{R}}^{n}$ respectively. It is easy to see that if $X$ is a convex cone and $A(X)$ is a linear subspace then

[TABLE]

and

[TABLE]

In other words, $\|A|X\|$ is the radius of the smallest ball in $A(X)$ centered at the origin that contains $A(X\cap{\mathbb{B}}^{n})$ . Similarly, $1/\|(A|X)^{-1}\|$ is the radius of the largest ball in $A(X)$ centered at the origin and that is contained in $A(X\cap{\mathbb{B}}^{n})$ . Example 3 illustrates this geometric interpretation of $\|A|X\|$ and $1/\|(A|X)^{-1}\|$ in a simple instance.

Example 3.

Let $A:=\begin{bmatrix}1&-1&0\\ -\epsilon&-\epsilon&1\end{bmatrix}$ for $0<\epsilon<1$ and $X={\mathbb{R}}^{3}_{+}$ . Let ${\mathbb{R}}^{2}$ be endowed with the Euclidean $\ell_{2}$ norm and let ${\mathbb{R}}^{3}$ be endowed with the $\ell_{1}$ norm. In this case $A(X)={\mathbb{R}}^{2}$ and

[TABLE]

Therefore $\|A|X\|=\sqrt{1+\epsilon^{2}}$ and $1/\|(A|X)^{-1}\|=\epsilon$ as Figure 3 illustrates.

The above norms, especially $\|(A|X)^{-1}\|$ and other related quantities, have been extensively studied in the literature on condition measures for convex optimization [6, 9, 11, 27, 31, 30]. They have been further extended to the broader variational analysis context [19, 7]. In particular, when $A(X)={\mathbb{R}}^{m}$ the family of conic systems $Ax=b,x\in X$ is well-posed. That is, for all $b\in{\mathbb{R}}^{m}$ the conic system $Ax=b,x\in X$ is feasible and remains so for sufficiently small perturbations of $(A,b)$ . In this case it follows from [31] that the quantity $1/\|(A|X)^{-1}\|$ is precisely the distance to ill-posedness introduced by Renegar [30, 31], that is, the size of the smallest perturbation $\Delta A$ on $A$ so that the conic system $(A+\Delta A)x=b,x\in X$ is infeasible for some $b\in{\mathbb{R}}^{m}$ . A similar identity holds for the distance to non-surjectivity of closed sublinear set-valued mappings [19]. The latter in turn extends to a far more general identity for the radius of metric regularity [7].

Observe that if $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ is a linear subspace then $A(X)$ is automatically a linear subspace. If in addition ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are each endowed with Euclidean norms, then (18) and (19) yield

[TABLE]

Corollary 1 and Theorem 1 below show that there is a tight connection between the relative strong convexity constant $\mu_{f,X,D}$ and the norm $\|(A|X)^{-1}\|$ when $f$ is of the form $g\circ A$ . Both of these results rely on the following proposition that characterizes a certain type of Hoffman constant [15]. Proposition 3 is closely related to developments in [26, 29]. Proposition 3 extends [29, Theorem 2] that only applies to the case $X={\mathbb{R}}^{n}_{+}$ .

Proposition 3.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms. Let $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ be a convex cone such that $A(X)$ contains more than one point. If $A(X)$ is a linear subspace then

[TABLE]

Proof.

Fix $y\in X$ and $x\in X\setminus Z_{A,X}(y)$ . Since $A(X)$ is a linear subspace, it follows that $Ay-Ax\in A(X)$ and thus $Ay-Ax=Au$ for some $u\in X$ with $\|u\|\leq\|(A|X)^{-1}\|\cdot\|Ay-Ax\|.$ Hence $x+u\in Z_{A,X}(y)$ and $\|Z_{A,X}(y)-x\|\leq\|u\|\leq\|(A|X)^{-1}\|\cdot\|Ay-Ax\|.$ Since this holds for arbitrary $y\in X$ and $x\in X\setminus Z_{A,X}(y)$ we conclude that

[TABLE]

To prove the reverse inequality, let $v\in A(X)$ and $0<\epsilon<\|(A|X)^{-1}\|$ be such that $\|v\|=1$ and $\|y\|\geq\|(A|X)^{-1}\|-\epsilon$ for all $y\in X$ with $Ay=v$ . Pick $\hat{y}\in X$ with $A\hat{y}=v$ . Then $\|z\|\geq\|(A|X)^{-1}\|-\epsilon>0$ for all $z\in Z_{A,X}(\hat{y})$ . Thus $\hat{x}:=0\in X\setminus Z_{A,X}(\hat{y})$ and

[TABLE]

To finish let $\epsilon\rightarrow 0$ . ∎

Proposition 3 readily yields the following result that generalizes Example 1.

Corollary 1.

Suppose ${\mathbb{R}}^{m}$ is endowed with the Euclidean norm $\|\cdot\|_{2}$ , ${\mathbb{R}}^{n}$ is endowed with a norm $\|\cdot\|$ , and $D(x,y)=\frac{1}{2}\|x-y\|^{2}$ . If $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $A\in{\mathbb{R}}^{m\times n}$ and $b\in{\mathbb{R}}^{m}$ , $X\subseteq{\mathbb{R}}^{n}$ is a convex cone, and $A(X)$ is a linear subspace that contains more than one point then

[TABLE]

Proof.

This follows from Proposition 3 and the observation that for this choice of $f$ and $X$ we have $Z_{f,X}(y)=Z_{A,X}(y)$ and $f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle=\frac{1}{2}\|Ay-Ax\|_{2}^{2}.$ ∎

The following result extends Corollary 1 to a broader class of functions.

Theorem 1.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms and $D(x,y)\leq\frac{1}{2}\|x-y\|^{2}$ for the norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ . Let $A\in{\mathbb{R}}^{m\times n},\;g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ be a convex differentiable function, and $X\subseteq{\mathbb{R}}^{n}$ be a convex cone such that $A(X)$ is a linear subspace that contains more than one point. If $g$ is $\mu_{g}$ strongly convex for the norm $\|\cdot\|$ in ${\mathbb{R}}^{m}$ then the function $f=g\circ A$ satisfies

[TABLE]

Proof.

Observe that $D_{f}(y,x)=g(Ay)-g(Ax)-\left\langle g(Ax),A(y-x)\right\rangle$ for all $y,x\in X$ Since $g$ is $\mu_{g}$ strongly convex, it follows that $D_{f}(y,x)\geq\mu_{g}\|Ay-Ax\|^{2}/2$ for all $y,x\in X$ and $Z_{f,X}(y)=\{x\in X:Ax=Ay\}=Z_{A,X}(y)$ for all $y\in X$ . Therefore Proposition 3 implies that

[TABLE]

∎

If $f,X,D$ are as in Corollary 1 then by Proposition 2 the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ is

[TABLE]

which has a striking resemblance to the classical condition number (1) of $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}.$ More generally, if $f,X,D$ are as in Theorem 1, $D(y,x)=\|y-x\|^{2}/2$ , and $g$ is also $L_{g}$ smooth then by Proposition 2 we obtain the following bound on the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ in terms of the condition number of $g$ and a condition number of the pair $(A,X)$ :

[TABLE]

3.2 Lower bound on $\mu_{f,X,D}$ when $X$ is a polyhedron

The results in Section 3.1 require $X$ to be a convex cone and $A(X)$ to be a linear subspace. We next provide some results of similar flavor that relax these assumptions in exchange for the assumption that $X$ is a polyhedron. The crux of the main results in this section is Proposition 4. This technical result is drawn from the recent paper of Peña, Vera, and Zuluaga [26]. The latter paper develops a number of properties of a new class of relative Hoffman bounds. In particular, it introduces the sets of tangent cones ${\mathcal{T}}(X)$ and ${\mathcal{T}}(A|X)$ described below. These two sets of tangent cones are at the heart of the main developments in [26].

For a nonempty polyhedron $X\subseteq{\mathbb{R}}^{n}$ let ${\mathcal{T}}(X):=\{T_{X}(x):x\in X\}$ , where $T_{X}(x)$ is the tangent cone of $X$ at $x$ , that is,

[TABLE]

We will rely on the following subset of ${\mathcal{T}}(X)$ that depends on how $A$ and $X$ fit together. Let

[TABLE]

In this definition, minimal is to be interpreted as minimal with respect to inclusion. This restriction guarantees that the set ${\mathcal{T}}(A|X)$ is of minimal size as it does not include redundant cones from ${\mathcal{T}}(X)$ .

Observe that ${\mathcal{T}}(X)$ is finite since $X$ is polyhedral and thus ${\mathcal{T}}(A|X)$ is finite as well. The following example illustrates the interesting relationship between $A$ and the tangent cones of $X$ captured by ${\mathcal{T}}(A|X)$ .

Example 4.

Suppose $A\in{\mathbb{R}}^{m\times n}$ and $X={\mathbb{R}}^{n}_{+}$ . In this case each element of ${\mathcal{T}}(X)$ is of the form $C_{I}=\{x\in{\mathbb{R}}^{n}:x_{I}\geq 0\}$ for some $I\subseteq\{1,\dots,n\}$ . Observe that $A(C_{I})$ is a linear subspace if and only if $Ax=0,\,x_{I}>0$ is feasible. Thus the set ${\mathcal{T}}(A|X)$ is in one-to-one correspondence with the maximal sets $I\subseteq\{1,\dots,n\}$ such that $Ax=0,\,x_{I}>0$ is feasible.**

Observe that ${\mathcal{T}}(A|X)=\{X\}$ when $X$ is a polyhedral cone and $A(X)$ is a linear subspace. Thus the following proposition subsumes Proposition 3 when $X$ is polyhedral.

Proposition 4.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms. Let $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ be a polyhedron such that $A(X)$ contains more than one point. Then

[TABLE]

Proof.

This follows as a special case of [26, Proposition 5 and Corollary 3]. ∎

Corollary 2.

Suppose ${\mathbb{R}}^{m}$ is endowed with the Euclidean norm $\|\cdot\|_{2}$ , ${\mathbb{R}}^{n}$ is endowed with a norm $\|\cdot\|$ , and $D(x,y)=\frac{1}{2}\|x-y\|^{2}$ . If $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $A\in{\mathbb{R}}^{m\times n}$ and $b\in{\mathbb{R}}^{m}$ , and $X\subseteq{\mathbb{R}}^{n}$ is a polyhedron such that $A(X)$ contains more than one point then

[TABLE]

Proof.

Proceed exactly as in the proof of Corollary 1 but apply Proposition 4 instead of Proposition 3. ∎

Theorem 2.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms and $D(x,y)\leq\frac{1}{2}\|x-y\|^{2}$ for the norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ . Let $A\in{\mathbb{R}}^{m\times n},\;g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ be a convex differentiable function, and $X\subseteq{\mathbb{R}}^{n}$ be a polyhedron such that $A(X)$ contains more than one point. If $g$ is $\mu_{g}$ strongly convex for the norm in ${\mathbb{R}}^{m}$ then the function $f=g\circ A$ satisfies

[TABLE]

Proof.

Proceeding exactly as in the proof of Theorem 1 but applying Proposition 4 instead of Proposition 3 we get

[TABLE]

∎

Observe that if $X$ is polyhedral then $\operatorname{span}(X-X)\in{\mathcal{T}}(X)$ and

[TABLE]

Thus Proposition 2 implies that for $f,X,D$ as in Corollary 2, the relative condition $L_{f,X,D}/\mu_{f,X,D}$ has the following expression, which is again strikingly similar to the classical condition number (1) of $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ :

[TABLE]

Proposition 2 also implies that if $f,X,D$ are as in Theorem 2, $D(y,x)=\|y-x\|^{2}/2$ , and $g$ is $L_{g}$ smooth then the relative condition number $L_{f,X,D}/\mu_{f,X,D}$ can be bounded in terms of the condition number of $g$ and a condition number of the pair $(A,X)$ as follows:

[TABLE]

We next place some of the developments by Peña and Rodríguez [28] in the context of this paper. To that end, consider the special case when $X$ is the standard simplex $\Delta_{n-1}:=\{x\in{\mathbb{R}}^{n}_{+}:\|x\|_{1}=1\}$ in ${\mathbb{R}}^{n}$ . For $A=\begin{bmatrix}a_{1}&\cdots&a_{n}\end{bmatrix}\in{\mathbb{R}}^{m\times n}$ let ${\mathsf{conv}}(A):={\mathsf{conv}}(\{a_{1},\dots,a_{n}\})=\{Ax:x\in\Delta_{n-1}\}$ and let $\mathsf{faces}({\mathsf{conv}}(A))$ denote the set of faces of ${\mathsf{conv}}(A)$ . Furthermore, for $F\in\mathsf{faces}({\mathsf{conv}}(A))$ let $A\setminus F$ denote the set of columns of $A$ that do not belong to $F$ . Suppose ${\mathbb{R}}^{m}$ is endowed with a norm and for $F,G\subseteq{\mathbb{R}}^{m}$ let $\mathsf{dist}(F,G):=\inf_{u\in F,v\in G}\|u-v\|$ . Following [28] define the facial distance $\Phi(A)$ of $A$ as follows

[TABLE]

Let $\mathsf{diam}(A)$ denote the diameter of the set of columns of $A$ defined as follows

[TABLE]

In the special case when $X=\Delta_{n-1}$ it follows from [28, Theorem 1] that (23) in Proposition 4 has the following geometric characterization

[TABLE]

Furthermore, in this same special case when $X=\Delta_{n-1}$ it is easy to see that (17) has the following geometric characterization

[TABLE]

Figure 4 gives a visualization of ${\mathsf{conv}}(A)$ and of the facial distance $\Phi(A)$ for $A=I_{3}$ and $A=I_{4}$ . It depicts ${\mathsf{conv}}(A)$ and $\Phi(A)$ in the hyperplane $\{x:\left\langle\mathbf{1},x\right\rangle=1\}$ .

Example 5 below, a special case of Corollary 2, shows that for $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ , $X=\Delta_{n-1}$ , and $D(y,x)=\frac{1}{2}\|y-x\|_{1}^{2}$ the relative condition number $L_{f,\Delta_{n-1},D}/\mu_{f,\Delta_{n-1},D}$ is the square of $\mathsf{diam}(A)/\Phi(A)$ , which has a flavor of an aspect ratio of ${\mathsf{conv}}(A)$ . This gives an interesting analogy to (1).

Example 5.

Suppose ${\mathbb{R}}^{n}$ is endowed with the $\ell_{1}$ norm, ${\mathbb{R}}^{m}$ is endowed with the Euclidean $\ell_{2}$ norm, and $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $A\in{\mathbb{R}}^{m\times n}$ with at least two different columns and $b\in{\mathbb{R}}^{m}$ . Then for $D(y,x):=\frac{1}{2}\|y-x\|_{1}^{2}$ Corollary 2 and identities (28) and (27) yield

[TABLE]

In particular,

[TABLE]

More generally, if $f(x)=g(Ax)$ for some $L_{g}$ smooth and $\mu_{g}$ strongly convex function $g$ then

[TABLE]

In particular,

[TABLE]

4 Properties of $\mu_{f,X,D}^{\star},$ and $\mu_{f,X,D}^{\sharp}$

We next provide bounds on $\mu_{f,X,D}^{\star}$ and $\mu_{f,X,D}^{\sharp}$ analogous to those developed in Section 3 for $\mu_{f,X,D}$ . Proposition 1 already established $\mu_{f,X,D}^{\star}\geq\mu_{f,X,D}\geq 0$ . It is intuitively clear that $\mu_{f,X,D}^{\star}$ could be a lot larger. When $D$ is a squared norm, the exact same technique used in [22, Theorem 1] show that $\mu_{f,X,D}^{\sharp}\geq\mu_{f,X,D}^{\star}$ . Indeed, when $D$ is a squared norm, the relationship among other variants of strong convexity introduced [22] extend to our context in a straightforward fashion as we next explain.

Definition 4.

Suppose $(f,X,D)$ satisfy Assumption 2.

(a)

We say that $f$ has $D$ -under approximation on $X$ if there exists a constant $\mu>0$ such that

[TABLE]

(b)

We say that $f$ has $D$ -gradient growth on $X$ if there exists a constant $\mu>0$ such that

[TABLE]

Suppose $(f,X,D)$ satisfies Assumption 2 and $D$ is a squared norm. Then for $\mu>0$ [22, Theorem 4] yields the following chain of implications for $(f,X,D,\mu)$ :

[TABLE]

We note that [22, Theorem 4] is stated and proven for the Euclidean norm but the same statement and proof hold for any norm.

From the above chain of implications it follows that if $(f,X,D)$ satisfies Assumption 2 and $D$ is a squared norm then $\mu_{f,X,D}\leq\mu^{\star}_{f,X,D}\leq\mu^{\sharp}_{f,X,D}$ . In particular, any lower bound on $\mu_{f,X,D}$ , such as those in Theorem 1 or Theorem 2, is also a lower bound on $\mu^{\star}_{f,X,D}$ and on $\mu^{\sharp}_{f,X,D}$ when $D$ is a squared norm. We next show that the ideas in Section 3 can be extended to obtain sharper bounds on these two constants.

4.1 A sharper lower bound on $\mu^{\star}_{f,X,D}$

Suppose $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ is a polyhedron such that $A(X)$ contains more than one point, and $S\subseteq X$ is nonempty. Proposition 4 readily implies

[TABLE]

Proposition 5 below, which extends Proposition 4, gives a sharper version of (31). Suppose $A\in{\mathbb{R}}^{m\times n},\;X\subseteq{\mathbb{R}}^{n}$ is a polyhedron, and $S\subseteq X$ is nonempty. Let

[TABLE]

where

[TABLE]

Proposition 5 can be proven via a straightforward modification of techniques in [26]. We provide the details of this modification in Appendix A.

Proposition 5.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms. Let $A\in{\mathbb{R}}^{m\times n}$ and $X\subseteq{\mathbb{R}}^{n}$ be a polyhedron such that $A(X)$ contains more than one point. Then for all nonempty $S\subseteq X$

[TABLE]

Furthermore, if $A(S)$ is convex then

[TABLE]

Corollary 3.

Suppose ${\mathbb{R}}^{m}$ is endowed with the Euclidean norm $\|\cdot\|_{2}$ , ${\mathbb{R}}^{n}$ is endowed with a norm $\|\cdot\|$ , and $D(x,y)=\frac{1}{2}\|x-y\|^{2}$ . If $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $A\in{\mathbb{R}}^{m\times n}$ and $b\in{\mathbb{R}}^{m}$ , and $X\subseteq{\mathbb{R}}^{n}$ is a polyhedron such that $A(X)$ contains more than one point and $X^{\star}:=\operatorname*{argmin}_{x\in X}f(x)\neq\emptyset$ . Then

[TABLE]

Proof.

Proceed exactly as in the proof of Corollary 1 but apply Proposition 5 instead of Proposition 3. ∎

The following theorem gives a lower bound on $\mu^{\star}_{f,X,D}$ analogous to the one on $\mu_{f,X,D}$ in Theorem 2. In light of Proposition 5, the lower bound on $\mu^{\star}_{f,X,D}$ in Theorem 3 is at least as large, and possibly much larger, than the one on $\mu_{f,X,D}$ in Theorem 2.

Theorem 3.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms and $D(y,x)\leq\frac{1}{2}\|y-x\|^{2}$ for the norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ . Let $A\in{\mathbb{R}}^{m\times n},\;g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ and $X\subseteq{\mathbb{R}}^{n}$ be a polyhedron such that $A(X)$ has more than one point. If $g$ is $\mu_{g}$ -strongly convex for the norm in ${\mathbb{R}}^{m}$ then the function $f=g\circ A$ satisfies

[TABLE]

Proof.

Observe that for all $y\in X^{\star}$ and $x\in X$

[TABLE]

Since $g$ is $\mu_{g}$ strongly convex on $A(X)$ , it follows that $D_{f}(y,x)\geq\mu_{g}\|Ay-Ax\|^{2}/2$ for all $y\in X^{\star}$ and $x\in X$ , and it also follows that $Z_{A,X}(y)=\{x\in X:Ax=Ay\}=X^{\star}$ for all $y\in X^{\star}$ . Therefore

[TABLE]

To finish, apply Proposition 5. ∎

Once again there is an interesting connection with the developments in [28] when $X=\Delta_{n-1}$ . Consider the special case when $X=\Delta_{n-1},\;A\in{\mathbb{R}}^{m\times n}$ has at least two different columns, $S\subseteq\Delta_{n-1}$ is nonempty, and $G\in\mathsf{faces}({\mathsf{conv}}(A))$ is the smallest face of ${\mathsf{conv}}(A)$ that contains $A(S)$ . From [28, Theorem 3] it follows that if ${\mathbb{R}}^{n}$ is endowed with the one-norm then

[TABLE]

The following example illustrates the difference between $\mu_{f,X,D}$ and $\mu^{\star}_{f,X,D}$ .

Example 6.

Suppose ${\mathbb{R}}^{n}$ is endowed with the one-norm and $D(y,x):=\frac{1}{2}\|y-x\|_{1}^{2}$ . Suppose ${\mathbb{R}}^{m}$ is endowed with the Euclidean norm, and $f(x)=\frac{1}{2}\|Ax-b\|_{2}^{2}$ for some $A\in{\mathbb{R}}^{m\times n}$ with at least two different columns and $b\in{\mathbb{R}}^{m}$ . As noted in Example 5, in this case

[TABLE]

This relative strong convexity constant depends only on $A$ but not on $b$ . On the other hand, the smallest face of ${\mathsf{conv}}(A)$ containing $X^{\star}$ is

[TABLE]

which evidently depends on both $A$ and $b$ . Theorem 3 and (35) yield

[TABLE]

It is evident that

[TABLE]

Furthermore, as it is illustrated in [28], the difference between these two quantities can be arbitrarily large. Consequently, the bound in Theorem 3 can be far sharper than that in Theorem 2. **

4.2 A sharper lower bound on $\mu^{\sharp}_{f,X,D}$

Suppose $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is defined as $f(x)=g(Ax)+\left\langle c,x\right\rangle$ where $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is a strongly convex function, $A\in{\mathbb{R}}^{m\times n}$ and $c\in{\mathbb{R}}^{n}$ . Theorem 3 does not apply to this kind of function due to the extra linear term $\left\langle c,x\right\rangle$ . Indeed for a function of this form the constant $\mu_{f,X,D}^{\star}$ may be zero, see Example 7 below. On the other hand, the next result shows that for a function of this form and for a polyhedral set $X$ it is always the case that $\mu^{\sharp}_{f,X,D}>0$ provided a suitable linear cut is added to $X$ .

Theorem 4.

Suppose ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$ are endowed with norms and $D(x,y)\leq\frac{1}{2}\|x-y\|^{2}$ for the norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ . Let $A\in{\mathbb{R}}^{m\times n},\;c\in{\mathbb{R}}^{n},$ and $X\subseteq{\mathbb{R}}^{n}$ be a polyhedron such that $A(X)$ contains more than one point. Suppose $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is $\mu_{g}$ -strongly convex for the norm in ${\mathbb{R}}^{m}$ and $f:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is defined via $f(x)=g(Ax)+\left\langle c,x\right\rangle$ . Then the vector $v:=2\nabla f(y)$ is the same for all $y\in X^{\star}$ and satisfies $\left\langle v,x-y\right\rangle\geq 0$ for all $x\in X,\;y\in X^{\star}$ . Furthermore, one of the following two possible cases applies depending on the range of values of $\left\langle v,x-y\right\rangle$ for $x\in X,y\in X^{\star}$ .

Case 1:

For all $x\in X,y\in X^{\star}$ we have $\left\langle v,x-y\right\rangle=0$ . In this case

[TABLE]

Case 2:

For some $x\in X,y\in X^{\star}$ we have $\left\langle v,x-y\right\rangle>0$ . In this case for all $\delta>0$

[TABLE]

for the polyhedron $X_{\delta}:=\{x\in X:\left\langle v,x-y\right\rangle\leq\delta\text{ for all }y\in X^{\star}\}\supseteq X^{\star}$ , the matrix $M\in{\mathbb{R}}^{(m+1)\times n},$ and the norm $\|\cdot\|$ in ${\mathbb{R}}^{m+1}$ defined as follows

[TABLE]

Proof.

The optimality conditions for $\min_{x\in X}f(x)$ imply that

[TABLE]

Thus for all $y,y^{\prime}\in X^{\star}$ the strong convexity of $g$ and (36) imply

[TABLE]

Hence $Ay=Ay^{\prime}$ whenever $y,y^{\prime}\in X^{\star}.$ In particular, $v=2\nabla f(y)=2(A^{\text{\sf T}}\nabla g(Ay)+c)$ is the same for all $y\in X^{\star}$ . Furthermore, the optimality conditions for $\min_{x\in X}f(x)$ imply that $\left\langle v,x-y\right\rangle\geq 0$ for all $x\in X,y\in Y^{\star}$ . In particular, $\left\langle v,y\right\rangle=\min_{x\in X}\left\langle v,x\right\rangle$ for all $y\in X^{\star}$ .

Next, the strong convexity of $g$ on $A(X)$ implies that for all $x\in X,y\in X^{\star}$

[TABLE]

If $\left\langle v,x-y\right\rangle=0$ for all $x\in X,y\in X^{\star}$ then Case 1 applies. In this case $Z_{A,X}(y)=\{x\in X:Ax=Ay\}=X^{\star}$ for all $y\in X^{\star}$ and thus

[TABLE]

If $\left\langle v,x-y\right\rangle>0$ for some $x\in X,y\in X^{\star}$ then Case 2 applies. In this case $Z_{M,X}(y)=\{x\in X:Ax=Ay,\left\langle v,x\right\rangle=\left\langle v,y\right\rangle\}=X^{\star}$ for all $y\in X^{\star}$ and thus

[TABLE]

Next, observe that for $y\in X^{\star}$ and $x\in X_{\delta}$

[TABLE]

To finish, apply Proposition 5 in either case. ∎

Observe that if $X$ in Theorem 4 is bounded then Case 2 gives a lower bound on $\mu^{\sharp}_{f,X,D}$ by taking $\delta:=\max_{x\in X,y\in X^{\star}}\left\langle v,x-y\right\rangle$ because $X=X_{\delta}$ for this choice of $\delta$ .

We conclude this section with a simple example showing that $\mu^{\sharp}_{f,X,D}>\mu^{\star}_{f,X,D}=0$ can occur. The example also shows that the additional bound on $X_{\delta}$ in Theorem 4, Case 2 cannot simply dropped without making some additional assumptions.

Example 7.

Let ${\mathbb{R}}^{3}$ be endowed with the one-norm and let $D(y,x):=\frac{1}{2}\|y-x\|_{1}^{2}.$ Suppose $f:{\mathbb{R}}^{3}\rightarrow{\mathbb{R}}$ is as follows

[TABLE]

If $X:=\Delta_{2}\subseteq{\mathbb{R}}^{3}$ then $X^{\star}=\{\begin{bmatrix}1/2&1/2&0\end{bmatrix}^{\text{\sf T}}\}$ . For $x=\begin{bmatrix}0&0&1\end{bmatrix}^{\text{\sf T}}$ we have $f(\bar{x})-f(x)-\left\langle\nabla f(x),\bar{x}-x\right\rangle=0$ and $\|\bar{x}-x\|_{1}=2$ . Hence $\mu^{\star}_{f,X,D}=0$ . On the other hand, Theorem 4 implies that $\mu^{\sharp}_{f,X,D}>0.$ A more careful calculation shows that in this case $\mu^{\sharp}_{f,X,D}=1/2$ .

On the other hand, if $X={\mathbb{R}}^{3}_{+}$ then $X^{\star}=\{\begin{bmatrix}t&t&0\end{bmatrix}^{\text{\sf T}}:t\geq 0\}.$ For $t>0$ and $x=\begin{bmatrix}0&0&t\end{bmatrix}^{\text{\sf T}}$ we have $f(x)-f^{\star}=t$ and $\|X^{\star}-x\|_{1}=t$ . Therefore $\mu^{\sharp}_{f,X,D}=0$ . Furthermore, in the context of Theorem 4 we have $v=\begin{bmatrix}0&0&2\end{bmatrix}^{\text{\sf T}}$ . A simple calculation shows that for all $\delta>0$ we have $X_{\delta}=\{x\in X:x_{3}\leq\delta/2\}$ and $\mu^{\sharp}_{f,X_{\delta},D}=2/(2+\delta/2).$ **

5 Convergence of first-order methods

This section details linear convergence results for the mirror descent algorithm, Frank-Wolfe algorithm, and Frank-Wolfe algorithm with away steps for problem (2). The linear convergence statements for the three algorithms are strikingly similar. They are stated in terms of the relative constants $L_{f,X,D}$ and $\mu^{\star}_{f,X,D},\mu^{\sharp}_{f,X,D}$ for suitable choices of distance-like functions $D$ .

5.1 Mirror descent algorithm

Suppose $h:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}\cup\{\infty\}$ is convex and differentiable on $X\subseteq{\mathbb{R}}^{n}$ and the Bregman proximal map

[TABLE]

is computable for $x\in X$ and $L>0$ . The mirror descent algorithm for problem (2) is based on the following update for $x\in X$ :

[TABLE]

Algorithm 1 gives a description of the mirror descent algorithm for (2).

Proposition 6 and Proposition 7 show the linear convergence of Algorithm 1 provided that suitable relative smoothness and relative quasi-strong convexity or relative functional growth conditions hold. Throughout the remaining of this subsection we assume that $(f,X,D_{h})$ satisfy Assumption 1.

We should note that Proposition 6 and its proof are straightforward modifications of the linear convergence results in [20, 32]. However, Proposition 6 shows that the linear convergence of Algorithm 1 holds with a sharper rate and under more general assumptions than those in [20, 32]. In particular, the rate in Proposition 6 is stated in terms of a relative quasi-strong convexity constant, which is always at least as large and possibly much larger than the kind of relative strong convexity constant in [20, 32]. Furthermore, our results in Section 3 and Section 4 guarantee linear convergence when $f$ is of the form $g\circ A$ provided $g$ and $h$ satisfy smoothness and strong convexity assumptions. The linear convergence results in [20, 32] do not apply for functions of this form because they are not strictly convex and thus the kind of relative strong convexity constant in [20, 32] is typically zero.

The following lemma, which is a straightforward extension of results presented in [32], provides the crux of the proof of Proposition 6.

Lemma 1.

Suppose $L:=L_{f,X,D_{h}}<\infty$ and $\mu:=\mu_{f,X,D_{h}}^{\star}>0$ . If $x\in X$ and

[TABLE]

then

[TABLE]

Proof.

Since $L=L_{f,X,D_{h}}$ and $\mu=\mu_{f,X,D_{h}}^{\star}$ we have

[TABLE]

and

[TABLE]

In addition, the three-point property of $D_{h}$ [5, Lemma 3.1] yields

[TABLE]

By putting together (39), (40), and (41) we get

[TABLE]

We get (38) by observing that the optimality conditions for (37) imply

[TABLE]

∎

Proposition 6.

Suppose $L:=L_{f,X,D_{h}}<\infty$ and $\mu:=\mu_{f,X,D_{h}}^{\star}>0$ . If $L_{k}=L,\;k=0,1,\dots$ in Algorithm 1 then the iterates generated by Algorithm 1 satisfy

[TABLE]

and

[TABLE]

Proof.

Lemma 1 applied to $x=x_{k}$ implies that

[TABLE]

Therefore

[TABLE]

Thus (42) readily follows. Inequality (43) also yields

[TABLE]

∎

Proposition 6 implies that if $L:=L_{f,X,D_{h}}<\infty$ and $\mu:=\mu_{f,X,D_{h}}^{\star}>0$ then Algorithm 1 yields $x_{k}\in X$ such that $f(x_{k})-f^{\star}<\epsilon$ in at most

[TABLE]

iterations.

Proposition 7 below shows that the same kind of iteration bound holds under a relative functional growth assumption instead of the quasi strong convexity assumption in Proposition 6. We note that although Proposition 7 is similar in flavor to Proposition 6, it is stated in terms of the novel concept of relative functional growth. Furthermore, neither Proposition 6 nor Proposition 7 implies the other since neither $\mu_{f,X,D_{h}}^{\star}$ nor $\mu_{f,X,D_{h}}^{\sharp}$ necessarily bounds the other. (See Example 2 and Example 7.)

Proposition 7.

Suppose $L:=L_{f,X,D_{h}}<\infty$ and $\mu:=\mu_{f,X,D_{h}}^{\sharp}>0$ . If $L_{k}=L,\;k=0,1,\dots$ in Algorithm 1 then for $K=\lceil 2L/\mu\rceil$ the iterates generated by Algorithm 1 satisfy

[TABLE]

In addition, Algorithm 1 yields $x_{k}\in X$ such that $f(x_{k})-f^{\star}<\epsilon$ in at most

[TABLE]

iterations.

Proof.

Since $L_{k}=L=L_{f,X,D_{h}}$ , it follows from [20, Theorem 3.1] that the $(k+K)$ -th iterate generated by Algorithm 1 satisfies

[TABLE]

Therefore, since $\mu:=\mu_{f,X,D_{h}}^{\sharp}>0$ ,

[TABLE]

Thus (44) follows. It also follows that for $k=mK,\;m=1,2,\dots$

[TABLE]

and thus (45) follows as well. ∎

To ease our exposition, in Proposition 6 and Proposition 7 we assumed $L_{k}=L$ is known and used in Step 3 of Algorithm 1. However, it is easy to see that these two results also hold if the assumption $L_{k}=L$ is relaxed to the assumption $L_{k}\leq L$ and $f(x_{k+1})\leq\min_{y\in X}\left\{f(x_{k})+\left\langle\nabla f(x_{k}),y-x_{k}\right\rangle+L_{k}D_{h}(y,x_{k})\right\}$ . The latter condition is easier to implement via a standard backtracking procedure. We also assume knowledge of suitable relative smoothness constants for the choice of stepsize $\alpha_{k}$ in Step 4 of Algorithm 2 and in Step 9 of Algorithm 3 below. As in Algorithm 1, this assumption can be relaxed via a standard backtracking procedure.

5.2 Frank-Wolfe algorithm

Suppose $X\subseteq{\mathbb{R}}^{n}$ is a compact convex set and a linear oracle for $X$ is available, that is, the map

[TABLE]

is computable.

The Frank-Wolfe algorithm, also known as the conditional gradient algorithm, for (2) is based on the following update for $x\in X:$

[TABLE]

Algorithm 2 gives a description of the Frank-Wolfe algorithm for (2).

Let $\mathfrak{R}:=\frac{\mathfrak{r}^{2}}{2}$ where $\mathfrak{r}:X\times X\rightarrow{\mathbb{R}}_{+}$ is the radial distance defined as follows: for $x,y\in X$

[TABLE]

Hence the relative smoothness constant $L_{f,X,\mathfrak{R}}$ is the smallest $L>0$ such that for all $x,u\in X$ and $\alpha\in[0,1]$

[TABLE]

Observe that the relative smoothness constant $L_{f,X,\mathfrak{R}}$ is precisely the curvature constant of $f$ on $X$ defined by Jaggi [16].

The relative quasi strong convexity constant $\mu_{f,X,\mathfrak{R}}^{\star}$ is the largest $\mu\geq 0$ such that for all $x\in X$

[TABLE]

Similarly, the relative functional growth constant $\mu_{f,X,\mathfrak{R}}^{\sharp}$ is the largest $\mu\geq 0$ such that for all $x\in X$

[TABLE]

The next result shows the linear convergence of Algorithm 2 when $L_{f,X,\mathfrak{R}}/\mu_{f,X,\mathfrak{R}}^{\star}$ or $L_{f,X,\mathfrak{R}}/\mu_{f,X,\mathfrak{R}}^{\sharp}$ is finite. As we note below, Proposition 8 is at least as sharp as the linear convergence rates established in [13, 3].

Proposition 8.

Suppose $L:=L_{f,X,\mathfrak{R}}<\infty$ and $\mu:=\max\{\mu^{\star}_{f,X,\mathfrak{R}},\mu^{\sharp}_{f,X,\mathfrak{R}}/4\}>0.$ If each stepsize $\alpha_{k}$ in Step 4 of Algorithm 2 is chosen via

[TABLE]

then the iterates generated by Algorithm 2 satisfy

[TABLE]

Proof.

It suffices to show that at iteration $k$

[TABLE]

Indeed, inequality (47), the choice of $\alpha_{k}$ , and (48) imply that

[TABLE]

We next show (48). The construction of the radial distance and the choice of $u$ in Algorithm 2 imply that

[TABLE]

We next consider the two possible values of $\mu=\max\{\mu^{\star}_{f,X,\mathfrak{R}},\mu^{\sharp}_{f,X,\mathfrak{R}}/4\}$ separately.

Case 1: $\mu=\mu_{f,X,\mathfrak{R}}^{\star}$ . In this case we have

[TABLE]

Rearranging and applying the arithmetic-mean geometric-mean inequality we get

[TABLE]

Case 2: $\mu=\mu_{f,X,\mathfrak{R}}^{\sharp}/4$ . In this case we have

[TABLE]

Therefore the last term is at least as large as the geometric mean of the first two and we get

[TABLE]

∎

To conclude this subsection, we discuss some natural bounds on $L_{f,X,\mathfrak{R}}$ and $\mu_{f,X,\mathfrak{R}}^{\star}$ . Recall that $\mathsf{ri}(X)$ denotes the relative interior of $X$ . Similarly, let $\mathsf{rbd}(X)$ denote the relative boundary of $X$ . As it was previously discussed in [16], from (47) it readily follows that if $f$ is $L_{f}$ -smooth on $X$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ then

[TABLE]

On the other hand, if $f$ is $\mu_{f}$ -strongly convex on $X$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ and the single element $x^{\star}\in X^{\star}$ satisfies $x^{\star}\in\mathsf{ri}(X)$ then for all $x\in X$ we have $\|x^{\star}-x\|=\mathfrak{r}(x^{\star},x)\|u-x\|\geq\mathfrak{r}(x^{\star},x)\|u-x^{\star}\|$ for some $u\in\mathsf{rbd}(X)$ . The strong convexity of $f$ thus implies both

[TABLE]

Therefore when $f$ is both $L_{f}$ -smooth and $\mu_{f}$ -strongly convex and $x^{\star}=\operatorname*{argmin}_{x\in X}f(x)\in\mathsf{ri}(X)$ we have

[TABLE]

Observe that the right-hand side in both inequalities is an interesting combination of the usual condition number of $f$ and a kind of condition number of the set $X$ around the point $x^{\star}$ . The first bound above and Proposition 8 yield a linear convergence result similar to [13, Theorem 2] but with a sharper rate.

The above bounds can be extended to a broader context. Suppose $f=g\circ A$ for some strongly convex function $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}\cup\{\infty\}$ and $A\in{\mathbb{R}}^{m\times n}$ . Then for all $x\in X,x^{\star}\in X^{\star}$ we have

[TABLE]

Consequently, if $X^{\star}\cap\mathsf{ri}(X)\neq\emptyset$ then for all $x^{\star}\in X^{\star}\cap\mathsf{ri}(X)$

[TABLE]

Observe that $\mathsf{dist}(Ax^{\star},\mathsf{rbd}(A(X)))$ can in turn be bounded below as follows

[TABLE]

Therefore when $f=g\circ A$ where $g$ is $L_{g}$ -smooth and $\mu_{g}$ -strongly convex then for all $x^{\star}\in X^{\star}\cap\mathsf{ri}(X)$ both $L_{f,X,\mathfrak{R}}/\mu_{f,X,\mathfrak{R}}^{\star}$ and $L_{f,X,\mathfrak{R}}/\mu_{f,X,\mathfrak{R}}^{\sharp}$ are bounded above by

[TABLE]

This bound and Proposition 8 yield a linear convergence result similar to [3, Proposition 3.2] but with a sharper rate.

5.3 Frank-Wolfe with away steps algorithm

Suppose $X\subseteq{\mathbb{R}}^{n}$ is a polytope and a vertex linear oracle for $X$ is available, that is, the map

[TABLE]

is computable and outputs a vertex of $X$ for all $g\in{\mathbb{R}}^{n}$ .

For this kind of linear oracle, each step of the Frank-Wolfe algorithm adds weight to some vertex $u$ . The basic idea of the Frank-Wolfe with away steps algorithm is to combine regular steps of the Frank-Wolfe algorithm with away steps that reduce weight from some vertex $a$ . To that end, the algorithm requires an additional vertex representation of $x\in X$ . More precisely, let $S(x)\subseteq\mathsf{vertices}(X)$ and $\lambda(x)\in\Delta(S(x)):=\{z\in{\mathbb{R}}^{S(x)}_{+}:\|z\|_{1}=1\}$ be such that

[TABLE]

Algorithm 3 describes a Frank-Wolfe with away steps algorithm. We should highlight that although the set $\mathsf{vertices}(X)$ could be immense, the algorithm does not require it explicitly. Instead the algorithm only maintains $S(x)$ and $\lambda(x)$ that are far more manageable. Indeed, by using the IRR procedure in [2] or its modification described in [14], Step 10 in Algorithm 3 can guarantee that the sets $S(x_{k})$ have size at most $n+1$ for $k=0,1,\dots$ .

Proposition 9 below establishes the linear convergence of Algorithm 3 under suitable relative smoothness and quasi strong convexity or functional growth conditions. To that end, we consider two variants of the radial distance. Let $\mathfrak{D}:=\frac{\mathfrak{d}^{2}}{2}$ where $\mathfrak{d}:X\times X\rightarrow{\mathbb{R}}_{+}$ is the diametral distance defined via

[TABLE]

The relative smoothness constant $L_{f,X,\mathfrak{D}}$ is the smallest $L>0$ such that for all $x,u,w\in X$ and $\alpha\in[0,1]$ with $x+\alpha(u-w)\in X$

[TABLE]

The relative smoothness constant $L_{f,X,\mathfrak{D}}$ is precisely the away curvature constant of $f$ on $X$ defined by Lacoste-Julien and Jaggi [18].

To capture the appropriate relative strong convexity conditions, we rely on a more involved variant of the radial distance. For $x\in X$ , let $\mathbf{S}(x)$ denote the collection of all subsets $S(x)\subseteq\mathsf{vertices}(X)$ such that $x$ is a positive convex combination of the elements in $S(x)$ . Let $\mathfrak{G}:=\frac{\mathfrak{g}^{2}}{2}$ where $\mathfrak{g}:X\times X\rightarrow{\mathbb{R}}_{+}$ is defined via

[TABLE]

The relative strong convexity constant $\mu_{f,X,\mathfrak{G}}$ is at least as large as

[TABLE]

The latter quantity is precisely the geometric strong convexity constant defined by Lacoste-Julien and Jaggi [18, Appendix C]. Notice that it matches $\mu_{f,X,\mathfrak{G}}$ when $f$ is strictly convex because in that case $Z_{f,X}(y)=\{y\}$ for all $y\in X$ . Otherwise, $\mu_{f,X,\mathfrak{G}}$ could be larger.

The relative quasi strong convexity constant $\mu_{f,X,\mathfrak{G}}^{\star}$ is the largest $\mu\geq 0$ such that for all $x\in X$

[TABLE]

Similarly, the relative functional growth constant $\mu_{f,X,\mathfrak{G}}^{\sharp}$ is the largest $\mu\geq 0$ such that for all $x\in X$

[TABLE]

Since $\mu_{f,X,\mathfrak{G}}\leq\mu_{f,X,\mathfrak{G}}^{\star}$ and $\mu_{f,X,\mathfrak{G}}$ is at least as large as the geometric strong convexity constant in [18, Appendix C], the following linear convergence result is at least as sharp as the one given in [18, Theorem 8] for the Frank-Wolfe with away steps algorithm.

Proposition 9.

Suppose $L:=L_{f,X,\mathfrak{D}}<\infty$ and $\mu:=\max\{\mu^{\star}_{f,X,\mathfrak{G}},\mu^{\sharp}_{f,X,\mathfrak{G}}/4\}>0.$ If each stepsize $\alpha_{k}$ in Step 9 of Algorithm 3 is chosen via

[TABLE]

then the iterates generated by Algorithm 3 satisfy

[TABLE]

Proof.

This proof follows a similar reasoning to the proof of Proposition 8. First we claim that at iteration $k$

[TABLE]

To show this claim, consider the two possible values of $\mu:=\max\{\mu^{\star}_{f,X,\mathfrak{G}},\mu^{\sharp}_{f,X,\mathfrak{G}}/4\}$ separately.

Case 1: $\mu=\mu^{\star}_{f,X,\mathfrak{G}}$ . In this case we have

[TABLE]

Rearranging and applying the arithmetic-mean geometric-mean inequality we get

[TABLE]

Case 2: $\mu=\mu_{f,X,\mathfrak{G}}^{\sharp}/4$ . In this case we have

[TABLE]

Therefore the last term is at least as large as the geometric mean of the first two and we get

[TABLE]

To finish the proof, we next show (52) by relying on (53). To do so, we replicate some of the main ideas previously introduced in [2, 18, 28].

The choice of $v$ at iteration $k$ and (52) imply that

[TABLE]

We consider separately the three possible cases that can occur for $\alpha_{k}$ at iteration $k$ , namely $\alpha_{k}<\alpha_{\max}$ , $\alpha_{k}=\alpha_{\max}\geq 1,$ and $\alpha_{k}=\alpha_{\max}<1.$

Case 1: $\alpha_{k}<\alpha_{\max}$ . In this case $|S(x_{k+1})|\leq|S(x_{k})|+1$ . In addition, inequalities (50) and (54), and the choice of $\alpha_{k}$ imply that

[TABLE]

Case 2: $\alpha_{k}=\alpha_{\max}\geq 1$ . In this case $|S(x_{k+1})|\leq|S(x_{k})|$ . In addition, inequality (50), the choice of $v$ , and the convexity of $f$ imply that

[TABLE]

Case 3: $\alpha_{k}=\alpha_{\max}<1$ . In this case $|S(x_{k+1})|\leq|S(x_{k})|-1$ . In addition, (50) and the choice of $\alpha_{k}$ imply that

[TABLE]

We next show that in the first $k$ iterations Case 3 can occur at most $k/2$ times by using the argument introduced by Lacoste-Julien and Jaggi in [18]. Since $|S(x_{0})|=1$ and $|S(x_{i})|\geq 1$ for $i=1,2,\dots,$ it follows that for each iteration when Case 3 occurred there must have been at least one previous iteration when Case 1 occurred. Hence in the first $k$ iterations Case 3 could occur at most $k/2$ times.

To finish the proof, observe that at every iteration $k$ when Case 1 or Case 2 occur inequalities (55) and (56) yield

[TABLE]

We note that the minimum in the last expression is is necessary because $\mu_{f,X,\mathfrak{G}}^{\sharp}>2L_{f,X,\mathfrak{D}}$ may indeed occur. For a concrete example, see [28, Example 6].

∎

We next discuss some bounds on $L_{f,X,\mathfrak{D}}$ and on $\mu_{f,X,\mathfrak{G}},\mu_{f,X,\mathfrak{G}}^{\star},\mu_{f,X,\mathfrak{G}}^{\sharp}$ in terms of the set $A:=\mathsf{vertices}(X)$ . We should note that the bounds below on $L_{f,X,\mathfrak{D}}$ and on $\mu_{f,X,\mathfrak{G}}$ have also been derived, albeit following a different approach, in [18, Appendix C].

From (50) it readily follows that if $f$ is $L_{f}$ -smooth on $X$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ then

[TABLE]

On the other hand, from [28, Theorem 1] it follows that for all $x,y\in X$

[TABLE]

where $\Phi(A)=\displaystyle\min_{F\in\mathsf{faces}({\mathsf{conv}}(A))\atop\emptyset\neq F\neq{\mathsf{conv}}(A)}\mathsf{dist}(F,{\mathsf{conv}}(A\setminus F))$ .

Hence if $f$ is $\mu_{f}$ -strongly convex on $X$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ then for all $y,x\in X$ we have

[TABLE]

and consequently

[TABLE]

Therefore when $f$ is both $L_{f}$ -smooth and $\mu_{f}$ -strongly convex on $X$ for some norm $\|\cdot\|$ in ${\mathbb{R}}^{n}$ we have

[TABLE]

Once again, the right-hand side is an interesting combination of the usual condition number of $f$ and a kind of condition number of $A=\mathsf{vertices}(X)$ . Furthermore, by proceeding as in Example 5 it follows that when $f$ is of the form $f(x)=\frac{1}{2}\|Bx-b\|_{2}^{2}$ for some $B\in{\mathbb{R}}^{m\times n}$ and $b\in{\mathbb{R}}^{m}$ we have $L_{f,X,\mathfrak{D}}=\mathsf{diam}(BA)^{2}$ and $\mu_{f,X,\mathfrak{G}}=\Phi(BA)^{2}$ . Thus for $f(x)=\frac{1}{2}\|Bx-b\|_{2}^{2}$ we have

[TABLE]

This illustrates how the condition number of $f$ relative to $X$ depends on how the shape of $X$ and $f$ fit together.

We also have the following sharper lower bound on $\mu_{f,X,\mathfrak{G}}^{\star}$ . From [28, Theorem 3] it follows that

[TABLE]

where $G\in\mathsf{faces}({\mathsf{conv}}(A))$ is the smallest face of ${\mathsf{conv}}(A)=X$ that contains $X^{\star}$ . It thus follows that if $f$ is $\mu_{f}$ -strongly convex on $X$ for some norm $\|\cdot\|$ then

[TABLE]

Finally we note that Theorem 4 implies that $\mu_{f,X,\mathfrak{G}}^{\sharp}>0$ when $f$ is of the form $f(x)=g(Ex)+\left\langle b,x\right\rangle$ for some strongly convex function $g$ . Indeed, with a slight abuse of notation, let $A\in{\mathbb{R}}^{n\times N}$ denote the matrix whose columns are the elements of $A$ and consider the function $\tilde{f}:{\mathbb{R}}^{N}\rightarrow{\mathbb{R}}$ defined via $\tilde{f}:=f\circ A$ . Observe that for $u,v\in\Delta_{N-1}$

[TABLE]

Consequently,

[TABLE]

for the distance function $D(v,u):=\frac{1}{2}\|v-u\|_{1}^{2}$ . The functional growth constant $\mu_{\tilde{f},\Delta_{N-1},D}^{\sharp}$ in turn can be bounded below as detailed in Theorem 4 since $\tilde{f}$ can be written as $\tilde{f}(u)=g(EAu)+\left\langle b,Au\right\rangle$ and $g$ is strongly convex.

The linear convergence bounds in Proposition 9 are tight modulo some small constants. This can be readily inferred from [28, Example 3 and Example 4].

Appendix A Proof of Proposition 5

The construction of $T_{X}(x;A,S)$ implies $T_{X}(x;A,S)\subseteq T_{X}(x)$ and $\|(A|T_{X}(x;A,S))^{-1}\|\leq\|(A|T_{X}(x))^{-1}\|$ for all $x\in X$ . Hence

[TABLE]

where the last step follows from [26, Lemma 1]. This proves the second inequality in (33).

Let $H:=\sup_{C\in{\mathcal{T}}(A|X,S)}\|(A|C)^{-1}\|.$ The first inequality in (33) can be stated as follows: for all $y\in S$ and $x\in X$

[TABLE]

We prove (57) by contradiction. Suppose that there exist $y\in S$ and $x\in X\setminus Z_{A,X}(y)$ such that $\|Z_{A,X}(y)-x\|>H\cdot\|Ay-Ax\|.$ That is,

[TABLE]

Let $v:=(Ay-Ax)/\|Ay-Ax\|$ and consider the convex optimization problem

[TABLE]

Observe that $v\in A(T_{X}(x;A,S))$ since $y-x\in T_{X}(x;A,S)$ . Thus there exists $u\in T_{X}(x;A,S)$ such that $Au=v$ and

[TABLE]

Therefore there exists $(u,t)$ feasible for (59) with $t>0$ . On the other hand, (58) implies that there does not exist any $(u,t)$ feasible for (59) with $t=\|Ay-Ax\|$ . It thus follows that (59) has an optimal solution $(\hat{u},\hat{t})$ with $0<\hat{t}<\|Ay-Ax\|$ . Now consider the modification of (59) obtained by replacing $x$ with $x+\hat{u}\in X$ :

[TABLE]

Proceeding as above with $x+\hat{u}$ in lieu of $x$ it follows that (60) has an optimal solution $(u^{\prime},t^{\prime})$ with $0<t^{\prime}<\|Ay-Ax\|-\hat{t}$ . In particular, $(\hat{u}+u^{\prime},\hat{t}+t^{\prime})$ is a feasible solution to (59) with $\hat{t}+t^{\prime}>\hat{t}$ which contradicts the optimality of $(\hat{u},\hat{t}).$ We therefore conclude that (57) must hold and thus (33) is proven.

We next prove (34) when $A(S)$ is convex. To that end, suppose $C\in{\mathcal{T}}(A|X,S)$ and $0<\epsilon<\|(A|C)^{-1}\|$ . Then $C=T_{X}(\hat{x};A,S)$ for some $\hat{x}\in X$ . Let $\hat{v}\in C$ be such that $A\hat{v}\neq 0$ and $\|v\|\geq(\|(A|C)^{-1}\|-\epsilon)\cdot\|A\hat{v}\|$ for all $v\in C$ with $Av=A\hat{v}$ . By scaling $\hat{v}$ if necessary we can assume that $A(\hat{x}+\hat{v})\in{{\mathsf{conv}}}(A(S))=A(S)$ and thus $A(\hat{x}+\hat{v})=A\hat{y}$ for some $\hat{y}\in S$ . Observe that $\hat{x}+v\in Z_{A,X}(\hat{y})$ implies both $v\in C$ and $Av=A\hat{v}$ . It thus follows that

[TABLE]

Since this holds for all $C\in{\mathcal{T}}(A|X,S)$ and $0<\epsilon<\|(A|C)^{-1}\|$ identity (34) follows. ∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research , 42(2):330–348, 2016.
2[2] A. Beck and S. Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming , 164:1–27, 2017.
3[3] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Meth. of Oper. Res. , 59(2):235–247, 2004.
4[4] S. Bubeck, Y. Lee, and M. Singh. A geometric alternative to Nesterov’s accelerated gradient descent. ar Xiv preprint ar Xiv:1506.08187 , 2015.
5[5] G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization , 3(3):538–543, 1993.
6[6] D. Cheung and F. Cucker. A new condition number for linear programming. Math. Prog. , 91(2):163–174, 2001.
7[7] A. L. Dontchev, A. S. Lewis, and R. T. Rockafellar. The radius of metric regularity. Trans. Amer. Math. Soc. , 355(2):493–517 (electronic), 2003.
8[8] D. Drusvyatskiy, M. Fazel, and S. Roy. An optimal first order method based on optimal quadratic averaging. SIAM Journal on Optimization , 28(1):251–271, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The condition number of a function relative to a set

Abstract

1 Introduction

2 Conditioning relative to a reference set and distance function pair

Assumption 1**.**

2.1 Relative smoothness and relative strong convexity

Definition 1**.**

Definition 2**.**

Example 1**.**

2.2 Relative quasi strong convexity and relative functional growth

Assumption 2**.**

Definition 3**.**

Proposition 1**.**

Proof.

Example 2**.**

3 Properties of Lf,X,DL_{f,X,D}Lf,X,D​ and μf,X,D\mu_{f,X,D}μf,X,D​ when

Proposition 2**.**

Proof.

3.1 Lower bound on μf,X,D\mu_{f,X,D}μf,X,D​ when XXX is a convex cone and A(X)A(X)A(X) is a linear subspace

Example 3**.**

Proposition 3**.**

Proof.

Corollary 1**.**

Proof.

Theorem 1**.**

Proof.

3.2 Lower bound on μf,X,D\mu_{f,X,D}μf,X,D​ when XXX is a polyhedron

Example 4**.**

Proposition 4**.**

Proof.

Corollary 2**.**

Proof.

Theorem 2**.**

Proof.

Example 5**.**

4 Properties of μf,X,D⋆,\mu_{f,X,D}^{\star},μf,X,D⋆​, and μf,X,D♯\mu_{f,X,D}^{\sharp}μf,X,D♯​

Definition 4**.**

4.1 A sharper lower bound on μf,X,D⋆\mu^{\star}_{f,X,D}μf,X,D⋆​

Proposition 5**.**

Corollary 3**.**

Proof.

Theorem 3**.**

Proof.

Example 6**.**

4.2 A sharper lower bound on μf,X,D♯\mu^{\sharp}_{f,X,D}μf,X,D♯​

Theorem 4**.**

Proof.

Example 7**.**

5 Convergence of first-order methods

5.1 Mirror descent algorithm

Lemma 1**.**

Proof.

Proposition 6**.**

Proof.

Proposition 7**.**

Proof.

5.2 Frank-Wolfe algorithm

Proposition 8**.**

Proof.

5.3 Frank-Wolfe with away steps algorithm

Proposition 9**.**

Proof.

Appendix A Proof of Proposition 5

Assumption 1.

Definition 1.

Definition 2.

Example 1.

Assumption 2.

Definition 3.

Proposition 1.

Example 2.

3 Properties of $L_{f,X,D}$ and $\mu_{f,X,D}$ when

Proposition 2.

3.1 Lower bound on $\mu_{f,X,D}$ when $X$ is a convex cone and $A(X)$ is a linear subspace

Example 3.

Proposition 3.

Corollary 1.

Theorem 1.

3.2 Lower bound on $\mu_{f,X,D}$ when $X$ is a polyhedron

Example 4.

Proposition 4.

Corollary 2.

Theorem 2.

Example 5.

4 Properties of $\mu_{f,X,D}^{\star},$ and $\mu_{f,X,D}^{\sharp}$

Definition 4.

4.1 A sharper lower bound on $\mu^{\star}_{f,X,D}$

Proposition 5.

Corollary 3.

Theorem 3.

Example 6.

4.2 A sharper lower bound on $\mu^{\sharp}_{f,X,D}$

Theorem 4.

Example 7.

Lemma 1.

Proposition 6.

Proposition 7.

Proposition 8.

Proposition 9.