A hybrid penalty method for a class of optimization problems with   multiple rank constraints

Tianxiang Liu; Ivan Markovsky; Ting Kei Pong; Akiko Takeda

arXiv:1906.10396·math.OC·June 26, 2019·SIAM J. Matrix Anal. Appl.

A hybrid penalty method for a class of optimization problems with multiple rank constraints

Tianxiang Liu, Ivan Markovsky, Ting Kei Pong, Akiko Takeda

PDF

Open Access

TL;DR

This paper introduces a hybrid penalty method combining penalty and pseudo-projection techniques to efficiently solve optimization problems with multiple rank constraints on Hankel matrices, relevant in system identification and signal processing.

Contribution

The paper proposes a novel hybrid penalty approach that integrates pseudo-projection methods for handling multiple rank constraints, with convergence analysis and practical efficiency demonstrated.

Findings

01

Efficient computation of pseudo-projection onto low-rank Hankel matrices using existing software.

02

Successful application of the hybrid method to numerical examples showing improved performance.

03

Convergence results established for the proposed hybrid penalty method.

Abstract

In this paper, we consider the problem of minimizing a smooth objective over multiple rank constraints on Hankel-structured matrices. This kind of problems arises in system identification, system theory and signal processing, where the rank constraints are typically "hard constraints". To solve these problems, we propose a hybrid penalty method that combines a penalty method with a post-processing scheme. Specifically, we solve the penalty subproblems until the penalty parameter reaches a given threshold, and then switch to a local alternating "pseudo-projection'' method to further reduce constraint violation. Pseudo-projection is a generalization of the concept of projection. We show that a pseudo-projection onto a {\em single} low-rank Hankel-structured matrix constraint can be computed efficiently by existing softwares such as SLRA (Markovsky and Usevich, 2014), under mild…

Equations216

p_{0} y (t) + p_{1} y (t + 1) + \dots + p_{s} y (t + s) = 0, for t = 1, \dots, T - s .

p_{0} y (t) + p_{1} y (t + 1) + \dots + p_{s} y (t + s) = 0, for t = 1, \dots, T - s .

p H_{s + 1} (y) = 0,

p H_{s + 1} (y) = 0,

H_{s + 1} (y) := y (1) y (2) y (3) ⋮ y (s + 1) y (2) y (3) \iddots y (s + 2) y (3) \iddots \dots \dots y (T - s) y (T - s + 1) ⋮ y (T)

H_{s + 1} (y) := y (1) y (2) y (3) ⋮ y (s + 1) y (2) y (3) \iddots y (s + 2) y (3) \iddots \dots \dots y (T - s) y (T - s + 1) ⋮ y (T)

y_{1}, \dots, y_{N} \in I R^{n} min

y_{1}, \dots, y_{N} \in I R^{n} min

s.t.

\widehat{\partial}h(y):=\Bigg{\{}u:\;\liminf_{\begin{subarray}{\ }v\to y\\ v\neq y\end{subarray}}\frac{h(v)-h(y)-{u}^{\top}(v-y)}{\|v-y\|}\geq 0\Bigg{\}},

\widehat{\partial}h(y):=\Bigg{\{}u:\;\liminf_{\begin{subarray}{\ }v\to y\\ v\neq y\end{subarray}}\frac{h(v)-h(y)-{u}^{\top}(v-y)}{\|v-y\|}\geq 0\Bigg{\}},

\partial h(y):=\{u:\exists u^{t}\to u,y^{t}\stackrel{{\scriptstyle h}}{{\to}}y\,\mbox{with}\,u^{t}\in\widehat{\partial}h(y^{t})\,\mbox{for each $t$}\},\vspace{-2 mm}

\partial h(y):=\{u:\exists u^{t}\to u,y^{t}\stackrel{{\scriptstyle h}}{{\to}}y\,\mbox{with}\,u^{t}\in\widehat{\partial}h(y^{t})\,\mbox{for each $t$}\},\vspace{-2 mm}

dist (X, Ω) := Y \in Ω in f ∥ X - Y ∥_{F} and P_{Ω} (X) := arg min_{Y \in Ω} ∥ X - Y ∥_{F} .

dist (X, Ω) := Y \in Ω in f ∥ X - Y ∥_{F} and P_{Ω} (X) := arg min_{Y \in Ω} ∥ X - Y ∥_{F} .

⟨ v, y - x ⟩ \leq \frac{σ}{2} ∥ y - x ∥^{2} for all y \in Ω with ∥ y - \overset{x}{ˉ} ∥ < ϵ .

⟨ v, y - x ⟩ \leq \frac{σ}{2} ∥ y - x ∥^{2} for all y \in Ω with ∥ y - \overset{x}{ˉ} ∥ < ϵ .

L_{i} (y) L (y) := H_{r_{i} + 1} (y_{i}), i = 1, \dots, N, := [H_{r + 1} (y_{1}) H_{r + 1} (y_{2}) \dots H_{r + 1} (y_{N})],

L_{i} (y) L (y) := H_{r_{i} + 1} (y_{i}), i = 1, \dots, N, := [H_{r + 1} (y_{1}) H_{r + 1} (y_{2}) \dots H_{r + 1} (y_{N})],

{\mathcal{H}}_{r+1}^{*}(Y)=\bigg{[}Y(1,1)\,\cdots\,\overbrace{\sum_{i+j=k+1}Y(i,j)}^{{\rm the}\ k{\rm th\ element}}\,\cdots\,Y(r+1,n-r)\bigg{]}^{\top}\in{\rm I\!R}^{n}.

{\mathcal{H}}_{r+1}^{*}(Y)=\bigg{[}Y(1,1)\,\cdots\,\overbrace{\sum_{i+j=k+1}Y(i,j)}^{{\rm the}\ k{\rm th\ element}}\,\cdots\,Y(r+1,n-r)\bigg{]}^{\top}\in{\rm I\!R}^{n}.

L^{*} [W_{1} W_{2} \dots W_{N}] = v ec (H_{r + 1}^{*} (W_{1}) H_{r + 1}^{*} (W_{2}) \dots H_{r + 1}^{*} (W_{N})) .

L^{*} [W_{1} W_{2} \dots W_{N}] = v ec (H_{r + 1}^{*} (W_{1}) H_{r + 1}^{*} (W_{2}) \dots H_{r + 1}^{*} (W_{N})) .

⟨ L^{*} [W_{1} W_{2} \dots W_{N}], y ⟩

⟨ L^{*} [W_{1} W_{2} \dots W_{N}], y ⟩

y \in I R^{N n} min

y \in I R^{N n} min

s.t.

y \in I R^{N n} min F (y) := f (y) + δ_{Ω} (y) + i = 1 \sum k δ_{C_{i}} (A_{i} (y)),

y \in I R^{N n} min F (y) := f (y) + δ_{Ω} (y) + i = 1 \sum k δ_{C_{i}} (A_{i} (y)),

Ω = {y : rank (L_{i} (y)) \leq r_{i}, i = 1, \dots, N}, C_{1} := {Y : rank (Y) \leq r} .

Ω = {y : rank (L_{i} (y)) \leq r_{i}, i = 1, \dots, N}, C_{1} := {Y : rank (Y) \leq r} .

Ω = {y : rank (L (y)) \leq r}, C_{i} = {Y : rank (Y) \leq r_{i}}, i = 1, \dots, N .

Ω = {y : rank (L (y)) \leq r}, C_{i} = {Y : rank (Y) \leq r_{i}}, i = 1, \dots, N .

Ω = I R^{N n}, C_{i} = {Y : rank (Y) \leq r_{i}}, i = 1, \dots, N, C_{N + 1} = {Y : rank (Y) \leq r} .

Ω = I R^{N n}, C_{i} = {Y : rank (Y) \leq r_{i}}, i = 1, \dots, N, C_{N + 1} = {Y : rank (Y) \leq r} .

F_{λ} (y) = f (y) + δ_{Ω} (y) + i = 1 \sum k \frac{1}{2 λ} dist^{2} (A_{i} (y), C_{i}),

F_{λ} (y) = f (y) + δ_{Ω} (y) + i = 1 \sum k \frac{1}{2 λ} dist^{2} (A_{i} (y), C_{i}),

F_{λ} (y)

F_{λ} (y)

u_{i}^{l} \in P_{Ω}^{s} (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}); y^{l})

u_{i}^{l} \in P_{Ω}^{s} (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}); y^{l})

F_{λ} (u_{i}^{l}) \leq [l - M]_{+} \leq j \leq l max F_{λ} (y^{j}) - \frac{c}{2} ∥ u_{i}^{l} - y^{l} ∥^{2} .

F_{λ} (u_{i}^{l}) \leq [l - M]_{+} \leq j \leq l max F_{λ} (y^{j}) - \frac{c}{2} ∥ u_{i}^{l} - y^{l} ∥^{2} .

Ω_{1} :

Ω_{1} :

Ω_{2} :

y = v ec (y_{1} \dots y_{N}) \in I R^{N n} min \frac{1}{2} ∥ y - y ∥^{2}

y = v ec (y_{1} \dots y_{N}) \in I R^{N n} min \frac{1}{2} ∥ y - y ∥^{2}

y = v ec (y_{1} \dots y_{N}) \in I R^{N n} min \frac{1}{2} ∥ y - y ∥^{2}

z^{t + 1} \in P_{Ω_{1}}^{s} (x^{t}; z^{t}) and x^{t + 1} \in P_{Ω_{2}}^{s} (z^{t + 1}; x^{t}) t = 0, 1, \dots

z^{t + 1} \in P_{Ω_{1}}^{s} (x^{t}; z^{t}) and x^{t + 1} \in P_{Ω_{2}}^{s} (z^{t + 1}; x^{t}) t = 0, 1, \dots

∥ y^{t, l_{t} + 1} - y^{t, l_{t}} ∥ \leq ϵ_{t}, F_{λ_{t}} (y^{t, l_{t}}) \leq F_{λ_{t}} (y^{t, 0}),

∥ y^{t, l_{t} + 1} - y^{t, l_{t}} ∥ \leq ϵ_{t}, F_{λ_{t}} (y^{t, l_{t}}) \leq F_{λ_{t}} (y^{t, 0}),

\displaystyle{\rm dist}\bigg{(}0,\nabla f(y^{t,l_{t}})\!+\!N_{\Omega}(y^{t,l_{t}+1})\!\!+\!\!\sum_{i=1}^{k}\frac{1}{\lambda_{t}}{\mathcal{A}}_{i}^{*}\!\left({\mathcal{A}}_{i}(y^{t,l_{t}})-{\mathcal{P}}_{C_{i}}({\mathcal{A}}_{i}(y^{t,l_{t}}))\right)\!\!\bigg{)}\leq\epsilon_{t}.

z^{t + 1} \in P_{Ω_{1}}^{s} (x^{t}; z^{t}) and x^{t + 1} \in P_{Ω_{2}}^{s} (z^{t + 1}; x^{t}) t = 0, 1, \dots

z^{t + 1} \in P_{Ω_{1}}^{s} (x^{t}; z^{t}) and x^{t + 1} \in P_{Ω_{2}}^{s} (z^{t + 1}; x^{t}) t = 0, 1, \dots

u_{i}^{l} - (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}))^{2} \leq y^{l} - (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}))^{2},

u_{i}^{l} - (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}))^{2} \leq y^{l} - (y^{l} - \frac{1}{L _{l, i}} (\nabla h (y^{l}) - ξ^{l}))^{2},

⟨ \nabla h (y^{l}) - ξ^{l}, u_{i}^{l} - y^{l} ⟩ \leq - \frac{L _{l, i}}{2} ∥ u_{i}^{l} - y^{l} ∥^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and numerical algorithms · Image and Signal Denoising Methods · Sparse and Compressive Sensing Techniques

Full text

\newsiamremark

remarkRemark \newsiamremarkconjectureConjecture

\newsiamthmclaimClaim \headersHybrid method for problems with multiple rank constraintsT. Liu, I. Markovsky, T. K. Pong and A. Takeda

A hybrid penalty method for a class of optimization problems with multiple rank constraints

Tianxiang Liu RIKEN Center for Advanced Intelligence Project, 1-4-1, Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan (). [email protected]

Ivan Markovsky Department ELEC, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussels, Belgium (). [email protected]

Ting Kei Pong Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong. This author was supported partly by Hong Kong Research Grants Council PolyU153004/18p. (). [email protected]

Akiko Takeda Department of Creative Informatics, Graduate School of Information Science and Technology, the University of Tokyo, Tokyo, Japan (), RIKEN Center for Advanced Intelligence Project, 1-4-1, Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan (). [email protected]

[email protected]

Abstract

In this paper, we consider the problem of minimizing a smooth objective over multiple rank constraints on Hankel-structured matrices. This kind of problems arises in system identification, system theory and signal processing, where the rank constraints are typically “hard constraints”. To solve these problems, we propose a hybrid penalty method that combines a penalty method with a post-processing scheme. Specifically, we solve the penalty subproblems until the penalty parameter reaches a given threshold, and then switch to a local alternating “pseudo-projection” method to further reduce constraint violation. Pseudo-projection is a generalization of the concept of projection. We show that a pseudo-projection onto a single low-rank Hankel-structured matrix constraint can be computed efficiently by existing softwares such as SLRA (Markovsky and Usevich, 2014), under mild assumptions. We also demonstrate how the penalty subproblems in the hybrid penalty method can be solved by pseudo-projection-based optimization methods, and then present some convergence results for our hybrid penalty method. Finally, the efficiency of our method is illustrated by numerical examples.

keywords:

Hankel-structure, system identification, hybrid penalty method, pseudo-projection.

{AMS}

15B05, 90C30.

1 Introduction

Many data modeling problems can be posed and solved as structured low-rank approximation problems, i.e., problems of approximating matrices by preserving the structure but reducing the rank [12]. The to-be-approximated matrices are constructed from data and the model’s complexity is related to the rank of the approximation—the lower the rank, the simpler the model. However, the simpler the model is, the higher the approximation error is. One way to deal with this fundamental trade-off between model complexity and model accuracy is to solve a sequence of low-rank approximation problems with increasing bounds on the rank.

In static linear data modeling problems, i.e., models defined by linear algebraic equations, the data matrices are unstructured. All spectral and Fröbenius norm optimal unstructured low-rank approximations can be obtained from truncation of the singular value decomposition [4]. This result, known as the Eckart–Young–Mirsky theorem [2], is at the heart of dimensionality reduction methods in machine learning [21]. Unstructured low-rank approximation is equivalent to the principal component analysis in statistics and the total least squares in numerical linear algebra [17].

The object of system theory, control, and signal processing is dynamical models. In linear time-invariant data modeling problems, i.e., for models defined by linear constant-coefficient difference equations, the data matrix is Hankel structured [1, 5, 10, 16]. To see this, consider a system defined by the equation

[TABLE]

By definition, the time series $y=[y(1),\ldots,y(T)]^{\top}\in{\rm I\!R}^{T}$ is a trajectory of the system if

[TABLE]

where $p:=[p_{0}\ p_{1}\ \cdots\ p_{s}]\neq 0$ is the parameter vector of the system and

[TABLE]

is a Hankel matrix, constructed from the time series. Therefore, ${\rm rank}({\mathcal{H}}_{s+1}(y))\leq s$ . The resulting Hankel structured low-rank approximation problem does not admit an analytic solution in terms of the singular value decomposition. For this reason, numerous local optimization [11] as well as convex relaxation [3] methods are proposed for solving it.

In this paper, we consider a generalization of the Hankel structured low-rank approximation problem to multiple rank constraints. An application that motivates this generalization is the common dynamics estimation problem in multi-channel signal processing [13, 14, 18]. Modeling each channel separately requires an individual rank constraint of a Hankel matrix in the optimization problem. Imposing the assumption that the channels have common dynamics then leads to an additional (coupling) rank constraint. The problem of common dynamics estimation is closely related to the problem of approximate common factor computation of multiple polynomials in computer algebra [6, 23]. Specifically, we consider the following optimization problem with multiple rank constraints:

[TABLE]

where $y=vec(y_{1}\cdots y_{N})$ (see Section 2 for notation), $r_{i}$ and $r$ are positive integers satisfying $r_{i}\leq r\leq\lfloor\frac{n-1}{2}\rfloor$ ( $i=1,\ldots,N$ ), and $f$ represents the loss function, which is nonnegative, level-bounded and smooth with Lipschitz continuous gradient. For example, $f(y)=\frac{1}{2}\|y-\widebar{y}\|^{2}$ , where $\widebar{y}\in{\rm I\!R}^{Nn}$ is the noisy observation signal.

For constrained problems such as (1) with smooth objectives, a classical solution method is the gradient projection algorithm, whose iterations require projections onto the feasible set. However, the coupling structure of the last constraint in (1) makes projection onto the feasible set a challenging problem: indeed, even the projection onto the set defined by each single constraint in (1) does not admit a closed-form solution. Thus, variants of proximal gradient algorithms cannot be directly applied to solving (1). Fortunately, we can show that one can obtain a so-called “pseudo-projection” (see Definition 2.2) onto the set defined by each single constraint by some existing solvers such as SLRA [15], under mild assumptions.

Motivated by this, we adopt a penalty approach and construct penalty subproblems whose feasible regions are either ${\rm I\!R}^{n}$ , or defined by either the first $N$ constraints or the last constraint in (1): the pseudo-projections are easy to compute in all these cases. We then propose an algorithm vNPGmajor for the penalty subproblems, making explicit use of the difference-of-convex (DC) structure of the penalty functions. The algorithm vNPGmajor is a variant of NPGmajor in [8, Algorithm 2] and is based on computing pseudo-projections, which can be done efficiently for the feasible region of the penalty subproblems.

While approximate solutions to (1) can now be obtained by our penalty method, such solutions are typically not feasible for (1). This is not ideal for applications such as system identification in which solution feasibility is an important concern [10]. Even though constraint violation can theoretically be reduced via solving a sequence of penalty subproblems with increasing weights in the penalty functions, in practice this strategy results in high computational cost and numerical instability. To resolve this issue, we shift to a post-processing method after obtaining a moderately accurate solution by our penalty method. Specifically, starting from such a solution obtained from the penalty method, we apply an alternating pseudo-projection method, alternating between the set defined by the first $N$ constraints in (1) and that defined by the last constraint there, to reduce constraint violation.

Our main contributions are highlighted as follows:

•

We propose a hybrid penalty method (Algorithm 2) for solving (1): A penalty scheme allowing three different kinds of penalty subproblems, followed by an alternating pseudo-projection method for post-processing. An algorithm, vNPGmajor (Algorithm 1), is proposed for the penalty subproblems.

•

We prove some convergence results for the hybrid penalty method, including an error bound for the penalty method (Theorem 3.3) and the convergence rate for the alternating pseudo-projection method (Theorem 3.7).

•

We demonstrate how a pseudo-projection can be obtained by the solver SLRA [15] in Section 4, under mild assumptions.

The rest of this paper is organized as follows. In Section 2, we introduce notation and some basic properties of Hankel operators. The hybrid penalty method and the corresponding convergence analysis are presented in Section 3. In Section 4, we demonstrate how to compute pseudo-projections. Numerical simulation results are presented in Section 5. Finally, we give some concluding remarks in Section 6.

2 Notation and preliminaries

Throughout this paper, we let ${\rm I\!R}^{n}$ denote the $n$ -dimensional Euclidean space and $\|\cdot\|$ denote the Euclidean norm induced by vector inner product $\langle\cdot,\cdot\rangle$ . For an $x\in{\rm I\!R}^{n}$ , we let $x(i)$ denote its $i$ th entry. For vectors $y_{1},\cdots,y_{N}\in{\rm I\!R}^{n}$ , we let $vec\left(y_{1}\cdots\,y_{N}\right):=[y_{1}^{\top}\cdots\ y_{N}^{\top}]^{\top}\in{\rm I\!R}^{Nn}$ . Given a matrix $A\in{\rm I\!R}^{m\times n}$ , we let $\|A\|_{F}$ denote its Fröbenius norm, $\|A\|_{2}$ denote its spectral norm, $A^{\top}$ denote its transpose and $A(i,j)$ denote its $(i,j)$ th entry. For $A$ , $B\in{\rm I\!R}^{m\times n}$ , we denote the matrix inner product by $\langle A,B\rangle:=\sum_{i=1}^{m}\sum_{j=1}^{n}A(i,j)B(i,j)$ . For a linear operator ${\mathcal{A}}$ , we use ${\mathcal{A}}^{*}$ , ${\rm Range}({\mathcal{A}})$ and ${\rm ker}({\mathcal{A}})$ to denote its adjoint, range and kernel, respectively.

For an extended-real-valued function $h:{\rm I\!R}^{n}\rightarrow{\rm I\!R}\cup\{\infty\}$ , we say that $h$ is proper if ${\rm dom}\,h:=\{x:h(x)<\infty\}\neq\emptyset$ , and is closed if it is lower semi-continuous. Following [20, Definition 8.3], for a proper closed function $h:{\rm I\!R}^{n}\rightarrow{\rm I\!R}\cup\{\infty\}$ , the regular subdifferential of $h$ at $y\in{\rm dom}\,h$ is defined as

[TABLE]

and the (limiting) subdifferential of $h$ at $y\in{\rm dom}\,h$ is defined as

[TABLE]

where $y^{t}\stackrel{{\scriptstyle h}}{{\to}}y$ means both $h(y^{t})\to h(y)$ and $y^{t}\to y$ . We say that $\bar{y}$ is a stationary point of $h$ if $0\in\partial h(\bar{y})$ . It is known from [20, Theorem 10.1] that any local minimizer of $h$ is a stationary point.

For a nonempty closed set $\Omega\subseteq{\rm I\!R}^{n}$ , we let $\delta_{\Omega}$ denote the indicator function of $\Omega$ , which is zero in $\Omega$ and is infinity otherwise. The regular normal cone and (limiting) normal cone of $\Omega$ at $y\in\Omega$ are defined by $\widehat{N}_{\Omega}(y):=\widehat{\partial}\delta_{\Omega}(y)$ and $N_{\Omega}(y):=\partial\delta_{\Omega}(y)$ respectively. We use ${\rm dist}(x,\Omega)$ to denote the distance from an $x\in{\rm I\!R}^{n}$ to $\Omega$ and ${\mathcal{P}}_{\Omega}(x)$ to denote the projection, i.e., ${\rm dist}(x,\Omega):=\inf_{y\in\Omega}\|x-y\|$ and ${\mathcal{P}}_{\Omega}(x):=\mathop{\rm arg\,min}_{y\in\Omega}\|x-y\|$ . For a nonempty closed set $\Omega\subseteq{\rm I\!R}^{m\times n}$ , the distance from an $X\in{\rm I\!R}^{m\times n}$ to $\Omega$ and its projection are defined with respect to the Fröbenius norm:

[TABLE]

We next recall the definition of prox-regular sets; see [20, Exercise 13.31].

Definition 2.1 (Prox-regular sets).

A closed set $\Omega$ is prox-regular at $\widebar{x}\in\Omega$ for $\widebar{v}\in N_{\Omega}(\widebar{x})$ if there exist $\epsilon>0$ and $\sigma\geq 0$ such that whenever $x\in\Omega$ and $v\in N_{\Omega}(x)$ with $\|x-\widebar{x}\|<\epsilon$ and $\|v-\widebar{v}\|<\epsilon$ , it holds that

[TABLE]

*Furthermore, $\Omega$ is prox-regular at $\widebar{x}$ if it is prox-regular at $\widebar{x}$ for all $\widebar{v}\in N_{\Omega}(\widebar{x})$ . *

We now define the notion of pseudo-projection, which will be used in our subsequent discussions.

Definition 2.2 (Pseudo-projection).

Let $\Omega\subseteq{\rm I\!R}^{n}$ be a nonempty closed set, $u\in\Omega$ and $x\in{\rm I\!R}^{n}$ . The pseudo-projection ${\mathcal{P}}^{s}_{\Omega}(x;u)$ of $x$ onto $\Omega$ with respect to $u$ is the collection of all $y\in\Omega$ satisfying:

(a)

(Stationarity)* $x-y\in N_{\Omega}(y)$ ; and* 2. (b)

(Function value improvement)* $\|y-x\|\leq\|u-x\|$ .*

Notice that any element of the pseudo-projection is a stationary point of the corresponding projection problem, i.e., it is a stationary point of the function $w\mapsto\frac{1}{2}\|w-x\|^{2}+\delta_{\Omega}(w)$ . Also, each such element improves the function value of the corresponding projection problem relative to a given point $u\in\Omega$ . Pseudo-projection onto a nonempty closed set is always nonempty: indeed, in view of [20, Example 6.16] and [20, Proposition 6.5], we have ${\mathcal{P}}_{\Omega}(x)\subseteq{\mathcal{P}}^{s}_{\Omega}(x;u)$ for all $x\in{\rm I\!R}^{n}$ and all $u\in\Omega$ .

For notational simplicity, we define linear operators ${\mathcal{L}}_{i}:{\rm I\!R}^{Nn}\to{\rm I\!R}^{(r_{i}+1)\times(n-r_{i})}$ ( $i=1,\ldots,N$ ) and ${\mathcal{L}}:{\rm I\!R}^{Nn}\to{\rm I\!R}^{(r+1)\times N(n-r)}$ as

[TABLE]

where $y=vec\left(y_{1}\cdots\,y_{N}\right)\in{\rm I\!R}^{Nn}$ , and $r_{i}$ ( $i=1,\ldots,N$ ) and $r$ are defined in (1). We now present some properties of the linear operators ${\mathcal{H}}_{l}(\cdot)$ and ${\mathcal{L}}^{*}$ .

Lemma 2.3.

For any $Y\in{\rm I\!R}^{(r+1)\times(n-r)}$ ,

[TABLE]

Lemma 2.4.

For any $W_{i}\in{\rm I\!R}^{(r+1)\times(n-r)}$ , $i=1,\ldots,N$ , it holds

[TABLE]

Proof 2.5.

Fix any $W_{i}\in{\rm I\!R}^{(r+1)\times(n-r)}$ , $i=1,\ldots,N$ . According to the definition of adjoint, for any $y=vec\left(y_{1}\ \cdots\ y_{N}\right)\in{\rm I\!R}^{Nn}$ , we have

[TABLE]

*Then the conclusion follows from this and the arbitrariness of $y$ . This completes the proof. *

3 A hybrid penalty method

Notice that there are multiple rank constraints in (1), making it difficult to compute the projection onto the feasible set. To handle these constraints, one intuitive idea is to use a penalty method to “reduce” the number of constraints. Specifically, we replace some or all constraints by penalty functions which consist of penalty parameters and measures of constraint violation. However, approximate solutions returned by penalty methods are typically not feasible for (1). Although we can theoretically reduce constraint violation by increasing the weights in the penalty functions when feasibility is important (e.g., in applications such as system identification [10]), this strategy leads to high computational cost and numerical instability in practice. One way out would be to shift to a local refinement method after obtaining a moderately accurate solution by the penalty method.

Based on these intuitive ideas, our solution method will then consist of two stages: a penalty method, followed by a post-processing scheme. We will describe the penalty method in Section 3.1, the post-processing scheme in Section 3.2 and the hybrid penalty method and its convergence analysis in Section 3.3.

3.1 Stage 1: A penalty method

To describe the penalty method, we first rewrite (1) as follows, using notation in (2):

[TABLE]

This can be further equivalently rewritten as

[TABLE]

with three ways of setting $k$ , ${\mathcal{A}}_{i}$ , $\Omega$ and $C_{i}$ :

•

Variant I: $k=1$ , ${\mathcal{A}}_{1}={\mathcal{L}}$ and

[TABLE]

•

Variant II: $k=N$ , ${\mathcal{A}}_{i}={\mathcal{L}}_{i}$ ( $i=1,\ldots,N$ ) and

[TABLE]

•

Variant III: $k=N+1$ , ${\mathcal{A}}_{i}={\mathcal{L}}_{i}$ ( $i=1,\ldots,N$ ), ${\mathcal{A}}_{N+1}={\mathcal{L}}$ and

[TABLE]

Notice that for the above three variants, the projection onto $C_{i}$ has a closed-form solution. On the other hand, while the projection onto $\Omega$ does not in general admit a closed-form solution, some kinds of stationary points of this projection problem can be approximately and efficiently obtained by some existing solvers such as SLRA [15], as we will show in Section 4, under mild assumptions.

Now we are ready to describe our penalty method. We first replace the constraints ${\mathcal{A}}_{i}(y)\in C_{i}$ ( $i=1,\ldots,k$ ) in (3) by a penalty for violating the constraints to obtain the auxiliary function

[TABLE]

where $\lambda>0$ is the penalty parameter. Then we approximately minimize the auxiliary function $F_{\lambda}(y)$ and update $y$ while decreasing $\lambda$ .

Note that each term of the penalty function in (4) can be written as the Moreau envelope of indicator function $\delta_{C_{i}}(\cdot)$ . Using the DC decomposition of the Moreau envelope as in [8, Equation 6], we see that

[TABLE]

where $h$ is a smooth function and $g$ is a convex function with $\sum_{i=1}^{k}\frac{1}{\lambda}{\mathcal{A}}_{i}^{*}\left({\mathcal{P}}_{C_{i}}({\mathcal{A}}_{i}(y))\right)\subseteq\partial{g(y)}$ ; see [8, Equation 7]. Recall that the projection onto $C_{i}$ is easy to compute. Thus, for Variant III, in which $\Omega={\rm I\!R}^{Nn}$ , $F_{\lambda}$ can be minimized via NPGmajor in [8, Algorithm 2]. However, for Variants I and II, the projection onto $\Omega$ is not easy to compute. Fortunately, one can obtain some kind of stationary points for the corresponding projection problems via specific solvers: as we shall see in Section 4, such a point belongs to the set of pseudo-projection (see Definition 2.2) under mild assumptions. Thus, we propose a variant of NPGmajor as Algorithm 1, which we call vNPGmajor, where we replace the projection in the subproblem by pseudo-projection.

The well-definedness of (7), i.e., whether the line-search loop terminates after a finite number of iterations, will be discussed in Section 3.3.

3.2 Stage 2: Post-processing scheme

After we obtain an approximate solution by the penalty method, we shift to a post-processing method. A natural and simple choice for post-processing is the alternating projection method. Let

[TABLE]

In the classical alternating projection method, one has to find the global minimizers of the following problems in each iteration, for some $\widetilde{y}$ .

[TABLE]

However, these problems are in general difficult to solve globally. Fortunately, as mentioned in Section 3.1, we can obtain some point in the set of pseudo-projection efficiently, under mild assumptions. Thus, we adopt the following alternating pseudo-projection method for post-processing: start at some $x^{0}\in\Omega_{2}$ and $z^{0}\in\Omega_{1}$ , let

[TABLE]

3.3 Hybrid penalty method for (1) and convergence analysis

The hybrid penalty method for solving (1), which consists of the penalty method discussed in Section 3.1 and the post-processing method discussed in Section 3.2, is presented as Algorithm 2.

For the rest of the section, we will analyze the convergence of the hybrid penalty method, including the convergence analysis for the penalty method in Section 3.3.2 and the convergence rate for the post-processing method in Section 3.3.3. Before proceeding, we first show that the criteria (7) and (12) are well-defined.

3.3.1 Well-definedness of (7) and (12)

The following theorem is about the well-definedness of the line-search criterion (7) and the termination criterion (12), i.e., they can be satisfied after finitely many number of inner iterations. The proof is similar to that in [8, Proposition 1].

Theorem 3.1.

*The line-search criterion (7) is well-defined. Moreover, $\{\widebar{L}_{l}\}$ is bounded. Furthermore, the termination criterion (12) for Algorithm 1 is well-defined. *

Proof 3.2.

We start by discussing the line-search criterion. First, we observe from (6) and Definition 2.2 that

[TABLE]

which is equivalent to

[TABLE]

Next, recall from the definition of $\xi^{l}$ and [8, Equation 7] that

[TABLE]

Using (14) and (15) together with $u_{i}^{l}\in\Omega$ , the $L$ -smoothness of $h$ and the convexity of $g$ gives (here, we let $L$ denote the Lipschitz continuity modulus of $\nabla h$ ):

[TABLE]

Thus, we see that (7) is satisfied whenever $L_{l,i}\geq L+c$ . From the definition of $L_{l,i}$ , this latter inequality must hold when $i$ satisfies $\tau^{i}L_{\min}\geq L+c$ , implying that the line-search criterion (7) is well-defined. Now, the boundedness of $\{\bar{L}_{t}\}$ can be argued as in [8, Proposition 1].

Next, let $\{y^{l}\}$ be generated by Algorithm 1 starting at a $y^{t,0}$ in Step 2 of Algorithm 2. We show that the termination criteria (12) hold after finitely many iterations in Algorithm 1 (with $y^{l}$ in place of $y^{t,l_{t}}$ and $y^{l+1}$ in place of $y^{t,l_{t}+1}$ in (12)). First, from (7), it is easy to see that the second inequality in (12) holds. Moreover, using a similar line of arguments as in [24, Lemma 4], we can show that

[TABLE]

Thus, the first inequality in (12) also holds after a finite number of iterations in Algorithm 1. Finally, we note from (6) and Definition 2.2 that

[TABLE]

Using this together with the definition of $h$ in (5), we further obtain

[TABLE]

Combining this relation with (15) gives

[TABLE]

*This inequality together with (16) and the boundedness of $\{\bar{L}_{l}\}$ shows that the third inequality in (12) holds after a finite number of iterations. This completes the proof. *

3.3.2 Convergence analysis for the penalty method in Algorithm 2

Notice that when $\widebar{\lambda}=0$ , the penalty method in Algorithm 2 is exactly the same as [8, Algorithm 1]. Thus, we know from [8, Theorem 2] that the sequence $\{y^{t}\}$ is bounded and that, under some constraint qualifications, any accumulation point of sequence $\{y^{t}\}$ is a stationary point of (3).

We next estimate the violation of the constraints for the solution given by the penalty method in Algorithm 2 in the following theorem. It implies that the constraint violation can be suppressed by terminating the algorithm at a small $\lambda_{t}$ .

Theorem 3.3.

Let $\{y^{t}\}$ be the sequence generated by the penalty method in Algorithm 2 for solving (3). Then we have for $t\geq 1$ and $i=1,...,k$ that

[TABLE]

Proof 3.4.

Note from the nonnegativity of $f$ , the definition of $y^{t}$ , the second inequality in (12) and the choice of $y^{t,0}$ and $y^{\rm feas}$ that for $i=1,\ldots,k$ ,

[TABLE]

*This completes the proof. *

3.3.3 Convergence analysis of the post-processing method in Algorithm 2

First, we present the following theorem which will be used later for the convergence analysis of the post-processing method in Algorithm 2.

Theorem 3.5.

*Let $\Omega_{2}$ be defined as in (8). Then $\Omega_{2}$ is prox-regular at any $\widebar{y}\in\Omega_{2}$ that satisfies ${\rm rank}({\mathcal{L}}(\widebar{y}))=r$ . *

Proof 3.6.

First, we can rewrite $\Omega_{2}$ as

[TABLE]

By [19, Corollary 2.3], we see that $\Omega_{2}$ is prox-regular at $\widebar{y}\in\Omega_{2}$ if the following conditions hold:

(a)

there is no $z\neq 0$ in $N_{C}({\mathcal{L}}(\widebar{y}))$ with ${\mathcal{L}}^{*}z=0$ ;

(b)

for every $\widebar{v}\in N_{\Omega_{2}}(\widebar{y})$ , the set $C$ is prox-regular at ${\mathcal{L}}(\widebar{y})$ for every $z\in N_{C}({\mathcal{L}}(\widebar{y}))$ with ${\mathcal{L}}^{*}z=\widebar{v}$ .

We will prove that the above two statements hold. First, we prove (a). Using ${\rm rank}({\mathcal{L}}(\widebar{y}))=r$ and noting that by assumption, we have $r\leq\frac{n-1}{2}$ and hence $N(n-r)\geq r+1$ , we see from [9, Proposition 3.6] that

[TABLE]

On the other hand, we see from Lemma 2.4 that for any $W=[W_{1}\ W_{2}\ \cdots\ W_{N}]$ with $W_{\ell}\in{\rm I\!R}^{(r+1)\times(n-r)}$ ( $\ell=1,\ldots,N$ ), we have

[TABLE]

Suppose that there exists some $\widehat{W}=[\widehat{W}_{1}\ \cdots\ \widehat{W}_{N}]\in N_{C}({\mathcal{L}}(\widebar{y}))\cap{\rm ker}({\mathcal{L}}^{*})$ with $\widehat{W}_{\ell}\in{\rm I\!R}^{(r+1)\times(n-r)}$ ( $\ell=1,\ldots,N$ ). We then know from (17) and (18) that

[TABLE]

Now we fix any $\ell$ . Note from (19) and Lemma 2.3 that

[TABLE]

We claim that $\widehat{W}_{\ell}=0$ . To prove this, we establish the following equivalent statement: for each $k=1,\ldots,n$ , all elements in the following set equal 0:

[TABLE]

First, it is easy to see from the equality in (20) that all elements in $S_{1}$ and $S_{n}$ are zero. Now we prove that every element in $S_{k}$ is zero by induction for each $k=1,2,\ldots,n-1$ .

Suppose that there exists some $K\geq 1$ so that every element in $\bigcup_{\ell=1}^{K}S_{\ell}$ is zero. Let $\widehat{W}_{\ell}(\widebar{i},\widebar{j})$ and $\widehat{W}_{{\ell}}(\widehat{i},\widehat{j})$ be any two elements in $S_{K+1}$ with $\widebar{i}<\widehat{i}$ . We then know from the first inequality in (20) that the $2\times 2$ submatrix formed by $\widehat{W}_{\ell}(\widebar{i},\widehat{j})$ , $\widehat{W}_{\ell}(\widebar{i},\widebar{j})$ , $\widehat{W}_{\ell}(\widehat{i},\widehat{j})$ and $\widehat{W}_{\ell}(\widehat{i},\widebar{j})$ is singular. Since $\widebar{i}+\widehat{j}<\widehat{i}+\widehat{j}=K+2$ , we conclude that $\widehat{W}_{\ell}(\widebar{i},\widehat{j})=0$ by the induction hypothesis. Consequently, there is at least one 0 in $\{\widehat{W}_{\ell}(\widebar{i},\widebar{j}),\widehat{W}_{{\ell}}(\widehat{i},\widehat{j})\}$ . By the arbitrariness of these two elements in $S_{K+1}$ , we see that there is at most one nonzero element in $S_{K+1}$ . This together with the equality in (20) implies that every element in $S_{K+1}$ equals 0. Thus, we have $\widehat{W}_{\ell}=0$ by induction. Since $\ell$ is arbitrary, we see further that $\widehat{W}=0$ . This proves that $N_{C}({\mathcal{L}}(\widebar{y}))\cap{\rm ker}({\mathcal{L}}^{*})=\{0\}$ , which is equivalent to statement (a).

*Now we prove (b). Using ${\rm rank}({\mathcal{L}}(\widebar{y}))=r$ , we know from [9, Proposition 3.8] that $C$ is prox-regular at ${\mathcal{L}}(\widebar{y})$ . Then by the definition of prox-regularity, we see that (b) holds. This completes the proof. *

Since (13) involves the pseudo-projection instead of the actual projection, the post-processing method in Algorithm 2 is different from the classical alternating projection method. Nevertheless, we can still show that the post-processing method in Algorithm 2 has local linear convergence under commonly used assumptions for establishing local linear convergence of the alternating projection method (see, for example, the assumptions used in [7, Theorem 5.16] and [9, Theorem 4.2]). The proof follows the same line of arguments as in [7, Theorem 5.2]. We include the proof in the Appendix for the convenience of the readers.

Theorem 3.7.

Let $\Omega_{1}$ and $\Omega_{2}$ be defined as in (8) and suppose that there exists some $\widebar{y}\in\Omega_{1}\cap\Omega_{2}$ such that ${\rm rank}({\mathcal{L}}(\widebar{y}))=r$ and $N_{\Omega_{1}}(\widebar{y})\cap-N_{\Omega_{2}}(\widebar{y})=\{0\}$ . Then for any initial points $x^{0}\in\Omega_{2}$ and $z^{0}\in\Omega_{1}$ near $\widebar{y}$ , any sequence generated by the following iterations converges to a point in $\Omega_{1}\cap\Omega_{2}$ $R$ -linearly:

[TABLE]

4 Subproblem: pseudo-projection

In this section, we consider the pseudo-projection subproblems (6) in Algorithm 1 and (13) in Algorithm 2. Recall that their corresponding projection problems can be put in the following general form:

[TABLE]

here, ${\mathcal{A}}(y)\in{\rm I\!R}^{p\times q}$ , and $d$ , $m$ , $p$ , $q$ and ${\mathcal{A}}$ are given as in (23) or (24) below, corresponding to (9) and (10) respectively:

[TABLE]

The pseudo-projection problem corresponding to (22) can now be stated as follows: given $\widehat{y}\in{\rm I\!R}^{d}$ and some reference point $y_{b}\in{\rm I\!R}^{d}$ satisfying ${\rm rank}({\mathcal{A}}(y_{b}))\leq m$ , compute

[TABLE]

In what follows, we will describe how such a $y_{s}$ can be obtained by the solver SLRA in [15]. Recall that SLRA was developed based on the following key observation:

[TABLE]

In view of this, algorithms were developed in [15] to approximately solve the following equivalent formulation of (22):

[TABLE]

where

[TABLE]

Notice that under the settings in (23) or (24), we have $p-m=1$ and hence (25) is an optimization problem in ${\rm I\!R}^{1\times p}$ and the feasible set reduces to $\{R\in{\rm I\!R}^{1\times p}:\;RR^{T}=1\}$ . We will show below in Section 4.1 that $\Psi$ in (26) is smooth on ${\rm I\!R}^{1\times p}\backslash\{0\}$ . Thus, when gradient-based optimization methods such as those described in [15] are applied to solving (25), one obtains a stationary point of the following function:

[TABLE]

We will then discuss in Section 4.2 how an element of ${\mathcal{P}}^{s}_{\{y:\;{\rm rank}({\mathcal{A}}(y))\leq m\}}(\widehat{y};y_{b})$ can be obtained from such a stationary point under mild assumptions.

4.1 Smoothness of $\Psi$

In this subsection, we will prove that $\Psi$ is smooth on ${\rm I\!R}^{1\times p}\backslash\{0\}$ . We start with an auxiliary lemma.

Lemma 4.1.

*Consider (22) with setting (23) or (24). For any $U\in{\rm I\!R}^{1\times q}$ and any $R\in{\rm I\!R}^{1\times p}\backslash\{0\}$ , if ${\mathcal{A}}^{*}({R}^{\top}U)=0$ , then $U=0$ . *

Proof 4.2.

Assume that $U\in{\rm I\!R}^{1\times q}$ and $R\in{\rm I\!R}^{1\times p}\backslash\{0\}$ satisfy ${\mathcal{A}}^{*}({R}^{\top}U)=0$ . We need to show that $U=0$ .

We first consider (22) with setting (23). In this case, we have $m=r_{i}$ , $p=r_{i}+1$ , $q=n-r_{i}$ and ${\mathcal{A}}(y)={\mathcal{H}}_{r_{i}+1}(y)$ . Notice that $R^{\top}\in{\rm I\!R}^{p\times 1}={\rm I\!R}^{r_{i}+1}$ and $U^{\top}\in{\rm I\!R}^{q\times 1}={\rm I\!R}^{n-r_{i}}$ . Write

[TABLE]

*and $W=R^{\top}U$ . Using Lemma 2.3, we obtain *

[TABLE]

Since ${\mathcal{A}}^{*}({R}^{\top}U)=0$ , to show that $U=0$ , it suffices to show that the $\widehat{R}\in{\rm I\!R}^{n\times(n-r_{i})}$ above has full column rank. To this end, we first note from $R\in{\rm I\!R}^{1\times(r_{i}+1)}\backslash\{0\}$ that there is at least one nonzero element in $R$ . Let $\widebar{i}$ be the first integer in $1,\ldots,r_{i}+1$ with $R(\widebar{i})\neq 0$ . Then the $(n-r_{i})\times(n-r_{i})$ submatrix of $\widehat{R}$ starting from the $\widebar{i}$ th row is lower triangular with all diagonal entries being $R(\widebar{i})\neq 0$ . Consequently, this submatrix is nonsingular and thus $\widehat{R}$ has full column rank. This completes the proof for this case.

Now we consider (22) with setting (24). In this case, we have $m=r$ , $p=r+1$ , $q=N(n-r)$ and ${\mathcal{A}}(y)={\mathcal{L}}(y)=[{\mathcal{H}}_{r+1}(y_{1})\cdots{\mathcal{H}}_{r+1}(y_{N})]$ with $y=vec(y_{1}\cdots y_{N})$ . Notice that $R^{\top}\in{\rm I\!R}^{p\times 1}={\rm I\!R}^{r+1}$ and $U^{\top}\in{\rm I\!R}^{q\times 1}={\rm I\!R}^{N(n-r)}$ . Write

[TABLE]

where $U_{i}^{\top}\in{\rm I\!R}^{n-r}$ ( $i=1,\ldots,N$ ). We then see from Lemma 2.4 that

[TABLE]

*Similar to the proof in setting (23), we can write the $k$ th block of ${\mathcal{A}}^{*}(R^{\top}U)$ as *

[TABLE]

Consequently, we have

[TABLE]

*Since ${\mathcal{A}}^{*}({R}^{\top}U)=0$ , to prove that $U=0$ , we only need to show that the block diagonal matrix on the right-hand side of (28) has full column rank. But then it suffices to show that $\widebar{R}$ has full column rank, and this latter claim can be established by following a similar line of arguments as in the proof for setting (23). This completes the proof. *

Theorem 4.3.

*Consider (22) with setting (23) or (24). Then the function $\Psi$ defined in (26) is smooth on ${\rm I\!R}^{1\times p}\backslash\{0\}$ . *

Proof 4.4.

In view of [22, Equation 5] and recall that $p-m=1$ (in both cases (23) and (24)), we only need to show that for any $R\in{\rm I\!R}^{1\times p}\backslash\{0\}$ , the linear map $G_{R}:{\rm I\!R}^{d}\longrightarrow{\rm I\!R}^{q}$ defined as $G_{R}(y):=(R{\mathcal{A}}(y))^{\top}$ is surjective, or equivalently, $G_{R}^{*}$ is injective. To proceed, fix any $R\in{\rm I\!R}^{1\times p}\backslash\{0\}$ and consider any $z\in{\rm I\!R}^{q}$ with $G_{R}^{*}(z)=0$ . Then we have for any $y\in{\rm I\!R}^{d}$ that

[TABLE]

*Thus we have ${\mathcal{A}}^{*}(R^{\top}z^{\top})=0$ , which together with Lemma 4.1 implies that $z=0$ . This completes the proof. *

Since $\Psi$ is smooth on ${\rm I\!R}^{1\times p}\backslash\{0\}$ , we can then apply standard gradient-based optimization methods to solving (25) and obtain a stationary point of $\widetilde{\Psi}$ in (27). We next discuss how one can obtain a pseudo-projection from such a stationary point.

4.2 Stationarity and improvement of function value

We discuss in this subsection how to obtain a pseudo-projection from a suitable stationary point $R^{*}$ of $\widetilde{\Psi}$ in (27), under mild assumptions. We start by showing how one can construct from $R^{*}$ a point satisfying the stationarity condition in Definition 2.2.

Theorem 4.5.

Consider (22) with setting (23) or (24). Let $R^{*}$ be a stationary point of $\widetilde{\Psi}$ in (27) and let $y^{*}$ achieve the infimum in (26) when $R=R^{*}$ . Then

[TABLE]

If in addition ${\rm rank}({\mathcal{A}}(y^{*}))=m$ , then we have

[TABLE]

Proof 4.6.

First, we define

[TABLE]

Then we see from (27) and the definition of $y^{*}$ that

[TABLE]

On the other hand, we also have from the stationarity of $R^{*}$ that $0\in\partial\widetilde{\Psi}(R^{*})=\partial\left(\Psi+\delta_{\{R:\;RR^{\top}=1\}}\right)(R^{*})$ . Using this, (32) and [20, Theorem 10.13], we see further that

[TABLE]

Next, notice from Lemma 4.1 that for any $U\in{\rm I\!R}^{1\times q}$ , $y\in{\rm I\!R}^{d}$ , $\lambda\in{\rm I\!R}$ and $R\in{\rm I\!R}^{1\times p}\backslash\{0\}$ , the following implication holds:

[TABLE]

This corresponds to the linear independence constraint qualification for the following optimization problem:

[TABLE]

Using this, the definition of $\Phi$ in (31), (33) and [20, Example 10.8], we deduce that there exist $V^{*}\in{\rm I\!R}^{1\times q}$ and a scalar $\lambda^{*}$ such that the following Karash-Kuhn-Tucker conditions hold:

[TABLE]

Multiplying both sides of the second equation in (34) from the right by ${R^{*}}^{\top}$ , and using the two equations in (35), we obtain $\lambda^{*}=0$ and thus

[TABLE]

We now show that

[TABLE]

To proceed, recall that $R^{*}\in{\rm I\!R}^{1\times p}$ , which implies ${\rm rank}({R^{*}}^{\top}V^{*})\leq 1$ . According to [9, Proposition 3.6], in order to establish (37), it now remains to show that

[TABLE]

To this end, take any $z\in[{\rm ker}({R^{*}}^{\top}V^{*})]^{\perp}\cap[{\rm ker}({\mathcal{A}}(y^{*}))]^{\perp}$ . Then we have in particular that $z\in[{\rm ker}({R^{*}}^{\top}V^{*})]^{\perp}={\rm Range}({V^{*}}^{\top}R^{*})$ . This together with (36) implies that ${\mathcal{A}}(y^{*})z\in{\mathcal{A}}(y^{*}){\rm Range}({V^{*}}^{\top}R^{*})=\{0\}$ . Thus, we must have $z\in{\rm ker}\left({\mathcal{A}}(y^{*})\right)\cap\left[{\rm ker}\left({\mathcal{A}}(y^{*})\right)\right]^{\perp}$ and consequently $z=0$ . This proves (38) and hence (37). The desired relation (29) now follows immediately from (34) and (37).

Suppose in addition that ${\rm rank}({\mathcal{A}}(y^{*}))=m$ . Then we have

[TABLE]

*where (a) follows from [9, Proposition 3.6] and the fact that proximal normal vectors are regular normal vectors [20, Example 6.16], (b) follows from [20, Theorem 10.6] and (c) follows from [20, Proposition 6.5]. This together with (29) proves (30). This completes the proof. *

We next show that if the stationary point $R^{*}$ of $\widetilde{\Psi}$ in (27) is obtained via a gradient-based descent optimization method with a suitably chosen initial point, then the $y^{*}$ that attains the infimum in (26) will satisfy the condition on function value improvement in Definition 2.2.

Theorem 4.7.

Consider (22) with setting (23) or (24). Let $y_{b}\in{\rm I\!R}^{d}$ satisfy ${\rm rank}\left({\mathcal{A}}(y_{b})\right)\leq m$ and let $R^{0}\in{\rm I\!R}^{1\times p}\backslash\{0\}$ satisfy $R^{0}{\mathcal{A}}(y_{b})=0$ . Then for any $\widetilde{R}\in{\rm I\!R}^{1\times p}\backslash\{0\}$ with $\Psi(\widetilde{R})\leq\Psi(R^{0})$ , we have

[TABLE]

*where $y_{\widetilde{R}}$ attains the infimum in (26) when $R=\widetilde{R}$ . *

Proof 4.8.

First, we see from $R^{0}{\mathcal{A}}(y_{b})=0$ and the definition of $\Psi$ in (26) that $\Psi(R^{0})\leq\frac{1}{2}\|y_{b}-\widehat{y}\|^{2}$ . This together with the assumption $\Psi(\widetilde{R})\leq\Psi(R^{0})$ and the fact that $y_{\widetilde{R}}$ attains the infimum in (26) when $R=\widetilde{R}$ shows that

[TABLE]

*This completes the proof. *

Remark 4.9 (Obtaining pseudo-projection in cases (23) or (24)).

*Let $y_{b}\in{\rm I\!R}^{d}$ satisfy ${\rm rank}({\mathcal{A}}(y_{b}))\leq m$ and let $R^{0}\in{\rm I\!R}^{1\times p}\backslash\{0\}$ satisfy $R^{0}{\mathcal{A}}(y_{b})=0$ . Then one can apply some standard gradient-based descent methods such as those implemented in SLRA [15] for solving (25) with $R^{0}$ as the initialization: these methods typically generate a sequence $\{R^{k}\}$ so that any accumulation point, say $R^{*}$ , is stationary for $\widetilde{\Psi}$ in (27) and satisfies $\Psi(R^{*})\leq\Psi(R^{0})$ . Suppose $y_{R^{*}}$ achieves the infimum in (26) when $R=R^{*}$ . Then we know from (30) in Theorem 4.5 and (39) in Theorem 4.7 that if ${\rm rank}({\mathcal{A}}(y_{R^{*}}))=m$ holds, then $y_{R^{*}}\in{\mathcal{P}}^{s}_{{\rm rank}({\mathcal{A}}(y))\leq m}(\widehat{y};y_{b})$ . *

4.3 Conjecture related to Theorem 4.5

In this subsection, we revisit the assumption ${\rm rank}({\mathcal{A}}(y^{*}))=m$ in Theorem 4.5. We would like to understand how likely such a condition is fulfilled by the $y^{*}$ that achieves the infimum in (26), with $R=R^{*}$ being a stationary point of $\widetilde{\Psi}$ in (27). Notice that if $R^{*}$ is indeed an optimal solution of $\widetilde{\Psi}$ , such a $y^{*}$ is an optimal solution of (22). Thus, we will first study whether ${\rm rank}({\mathcal{A}}(y^{*}))=m$ when $y^{*}$ is an optimal solution of (22). Specifically, we make the following conjecture:

Conjecture 4.10.

Let $s$ be a positive integer. Suppose that $\widehat{y}\in{\rm I\!R}^{n}$ satisfies the condition ${\rm rank}({\mathcal{H}}_{s+1}\left(\widehat{y})\right)=s+1$ and let $y^{*}$ solve the following optimization problem:

[TABLE]

*Then we have ${\rm rank}\left({\mathcal{H}}_{s+1}(y^{*})\right)=s$ . *

We do not know whether Conjecture 4.10 holds true for all positive numbers $s$ . However, we are able to prove that it holds true when $s=1$ .

Proposition 4.11.

*Conjecture 4.10 holds true when $s=1$ . *

Proof 4.12.

Since $s=1$ , we only need to show that there exists $\widebar{y}\in{\rm I\!R}^{n}$ with ${\rm rank}\left({\mathcal{H}}_{2}(\widebar{y})\right)=1$ and $\|\widebar{y}-\widehat{y}\|^{2}<\|\widehat{y}\|^{2}$ . First of all, since ${\rm rank}({\mathcal{H}}_{2}(\widehat{y}))=2$ , we must have $n\geq 3$ . We consider two cases:

[TABLE]

For case (i), we let $\widebar{y}=[\widehat{y}(1)\ 0\cdots 0]^{\top}$ when $\widehat{y}(1)\neq 0$ , and $\widebar{y}=[0\cdots 0\ \widehat{y}(n)]^{\top}$ when $\widehat{y}(n)\neq 0$ . Then ${\rm rank}\left({\mathcal{H}}_{2}(\widebar{y})\right)=1$ and

[TABLE]

Now we consider case (ii). Notice that there exists at least one nonzero element in $\{\widehat{y}(2),\cdots,\widehat{y}(n-1)\}$ because ${\rm rank}\left({\mathcal{H}}_{2}(\widehat{y})\right)=2$ . Hence, there are at most $n-2$ distinct real roots for the polynomial equation $\sum_{i=2}^{n-1}\widehat{y}(i)(z)^{i-1}=0$ . Let $\bar{z}\neq 0$ be a real number different from these roots. Then we have $\sum_{i=0}^{n-1}(\widebar{z})^{2i}>0$ . Let

[TABLE]

Then $\widebar{c}\neq 0$ and ${\rm rank}\left({\mathcal{H}}_{2}(\widebar{y})\right)=1$ . Consequently,

[TABLE]

*This completes the proof. *

5 Numerical experiments

In this section, we will conduct numerical experiments for our hybrid penalty method, i.e., Algorithm 2. All numerical experiments are performed in Matlab R2019a on a 64-bit PC with 3.8 GHz Intel Core i5 Quad-Core and 8GB of DDR4 RAM.

We consider the following problem with two rank constraints:

[TABLE]

where $\|y\|_{W}:=\sqrt{y^{\top}Wy}$ , $W$ is the $n\times n$ diagonal matrix so that $W(i,i)$ equals $1$ when $i$ is odd, and equals $10$ when $i$ is even, $n_{1}$ , $n_{2}$ and $n_{c}$ are given positive integers, and $\widebar{y}_{1}\in{\rm I\!R}^{n}$ and $\widebar{y}_{2}\in{\rm I\!R}^{n}$ are known noisy signals.

Let HB_1, HB_2 and HB_3 represent the three hybrid penalty methods which solve (5) by Algorithm 2 via the reformulation (3) with Variant I, Variant II and Variant III discussed in Section 3.1 respectively. Let AP represent the alternating pseudo-projection algorithm (11) applied directly to the sets $\Omega_{1}$ and $\Omega_{2}$ defined in (8), constructed based on the data from (5).

Data generation: We set $n=50$ and consider two 3-tuples $(n_{1},n_{2},n_{c})=(2,2,2)$ and $(n_{1},n_{2},n_{c})=(2,6,4)$ . For each 3-tuple, we first randomly generate two signals $y_{1}$ and $y_{2}$ from two marginally stable linear time-invariant systems of order at most $n_{1}+n_{c}$ and $n_{2}+n_{c}$ respectively, which have $n_{c}$ common poles. Then we let $\widebar{y}_{1}=y_{1}+\sigma\cdot W^{-1/2}\xi_{1}$ and $\widebar{y}_{2}=y_{2}+\sigma\cdot W^{-1/2}\xi_{2}$ , where $\sigma=0.1$ is the noise factor, and $\xi_{1}$ and $\xi_{2}$ are random vectors with i.i.d. standard Gaussian entries.

HB_1, HB_2 and HB_3: In Algorithm 1, we set $L_{\max}=10^{8}$ , $L_{\min}=10^{-8}$ , $\tau=2$ , $c=10^{-4}$ , $M=4$ , $L_{0}^{0}=1$ and for $l\geq 1$ ,

[TABLE]

All pseudo-projection subproblems that arise are approximately solved by calling SLRA [15] with default setting (except that the $R^{0}$ is specified as in Remark 4.9). We terminate Algorithm 1 when the number of iterations exceeds 108 or

[TABLE]

For the penalty method in Algorithm 2, we set $y^{{\rm feas}}=0$ , $\lambda_{t}=\lambda_{t-1}/5$ with initial $\lambda_{0}=0.1$ , $\widebar{\lambda}=10^{-4}$ and $\epsilon_{t}=\max\left\{\epsilon_{t-1}/1.5,10^{-6}\right\}$ with initial $\epsilon_{0}=10^{-5}$ . Let $\widebar{y}=vec\left(\widebar{y}_{1}\ \widebar{y}_{2}\right)$ . We set the initial point $y^{0}$ for HB_1 and HB_2 as a pseudo-projection of $\widebar{y}$ onto $\Omega_{1}$ and $\Omega_{2}$ respectively, obtained by calling SLRA in [15] with default setting (the reference point is the origin). For HB_3, we set $y^{0}=\widebar{y}$ .

For the post-processing method in Algorithm 2, we also call SLRA in [15] with default settings to approximately compute a pseudo-projection (except that the $R^{0}$ is specified as in Remark 4.9), and terminate it when the number of iterations exceeds 105 or

[TABLE]

We output $z^{t}$ as the approximate solution.

AP: In this method, we start at $\widebar{y}=vec\left(\widebar{y}_{1}\ \widebar{y}_{2}\right)$ and call SLRA in [15] with default setting (except that the $R^{0}$ is specified as in Remark 4.9) to approximately compute a pseudo-projection onto $\Omega_{1}$ and $\Omega_{2}$ defined in (8) (the initial reference points are the origin). We also output $z^{t}$ as the approximate solution.

Numerical results: In Figure 1, we compare the four methods AP, HB_1, HB_2 and HB_3 in terms of terminating function values over 100 random instances for $(n_{1},n_{2},n_{c})=(2,2,2)$ and over 30 random instances for $(n_{1},n_{2},n_{c})=(2,6,4)$ . 111For each 3-tuple, we first generate $y_{1}$ and $y_{2}$ as described above. For these two fixed signals, we generate 100 (and, resp., 30) random noisy signals $\bar{y}_{1}$ and $\bar{y}_{2}$ and solve the corresponding instances. One can see that while the three hybrid penalty methods HB_1, HB_2 and HB_3 have comparable performance, they always outperform AP.

In Figure 2, we compare the three hybrid penalty methods HB_1, HB_2 and HB_3 in terms of constraint violation (before and after post-processing) and CPU time over 30 random instances for $(n_{1},n_{2},n_{c})=(2,6,4)$ . We measure constraint violation by ${\rm log}_{10}(vio)$ , with vio given by

[TABLE]

where $y_{1}^{*}$ and $y_{2}^{*}$ are computed solutions, $m_{1}=n_{1}+n_{c}$ , $m_{2}=n_{2}+n_{c}$ , $m=n_{1}+n_{2}+n_{c}$ and $\Xi_{s}:=\{Y:{\rm rank}(Y)\leq s\}$ . One can see that the post-processing scheme significantly reduces constraint violation. On the other hand, HB_2 is faster than HB_1 and HB_3.

6 Concluding remarks

In this paper, we propose a hybrid penalty method for solving (1). The hybrid penalty method consists of two parts: a penalty scheme which makes use of a special penalty function as in [8], and a post-processing method for reducing constraint violation. Both the penalty subproblems and the subproblems in the post-processing method involve the new concept of pseudo-projections: we discussed in Section 4 in detail how pseudo-projections can be computed efficiently by some existing software such as [15], under mild assumptions.

There are several open questions related to pseudo-projection computation. For instance, we still do not know how likely the condition ${\rm rank}({\mathcal{A}}(y^{*}))=m$ holds for the $y^{*}$ that achieves the infimum in (26) (with $R=R^{*}$ being a stationary point of $\widetilde{\Psi}$ in (27)). 222In the numerical experiments in Section 5, the condition ${\rm rank}({\mathcal{A}}(y^{*}))=m$ almost never fails for the solution $y^{*}$ returned by SLRA: For over 99.9% of our calls to SLRA, the $m$ th singular value of ${\mathcal{A}}(y^{*})$ is significantly larger than its next singular value. Even assuming $y^{*}$ is a solution of (40), we can only establish ${\rm rank}({\mathcal{H}}_{s+1}(y^{*}))=s$ when $s=1$ . The case for $s>1$ is still open.

Appendix A Proof of Theorem 3.7

Before proving Theorem 3.7, we first state two auxiliary lemmas without proofs. The proof of Lemma A.1 can be found in the first paragraph in the proof of [7, Theorem 5.16], and Lemma A.2 follows from Theorem 3.5 and the same argument as in the proof of [7, Theorem 5.16].

Lemma A.1.

Let $\Omega_{1}$ and $\Omega_{2}$ be defined as in (8), $\widebar{y}\in\Omega_{1}\cap\Omega_{2}$ and define

[TABLE]

*where $B$ is the closed unit ball. Then $N_{\Omega_{1}}(\widebar{y})\cap-N_{\Omega_{2}}(\widebar{y})=\{0\}$ if and only if $\widebar{c}<1$ . *

Lemma A.2.

Let $\Omega_{1}$ and $\Omega_{2}$ be defined as in (8). Suppose that there exists some $\widebar{y}\in\Omega_{1}\cap\Omega_{2}$ such that ${\rm rank}({\mathcal{L}}(\widebar{y}))=r$ and $N_{\Omega_{1}}(\widebar{y})\cap-N_{\Omega_{2}}(\widebar{y})=\{0\}$ . Let $\widebar{c}$ be defined as in (42). Then for any $c\in(\widebar{c},1)$ , there exist some $\epsilon>0$ and $\delta\in[0,\frac{1-c}{2})$ such that

[TABLE]

*where $B_{\epsilon}(\widebar{y})$ is the closed ball with centre $\widebar{y}$ and radius $\epsilon$ , and $B$ is the closed unit ball. *

We now prove Theorem 3.7. The proof follows the same line of arguments as in [7, Theorem 5.2].

Proof A.3.

Fix any $c\in(\widebar{c},1)$ with $\widebar{c}$ defined as in (42), and let $\delta$ and $\epsilon$ be given as in Lemma A.2. We first claim that

[TABLE]

where $c_{0}:=c+2\delta$ . To prove this, note from (21) and Definition 2.2 that

[TABLE]

If $\|x^{t+1}-z^{t+1}\|=0$ or $\|x^{t}-z^{t+1}\|=0$ , we then see from the second inequality in (47) that (45) holds trivially. Now we assume that $\|x^{t+1}-z^{t+1}\|\neq 0$ and $\|x^{t}-z^{t+1}\|\neq 0$ . We first notice from (47), $\|z^{t+1}-\widebar{y}\|\leq\frac{\epsilon}{2}$ and $\|z^{t+1}-x^{t}\|\leq\frac{\epsilon}{2}$ that

[TABLE]

Using (46), (48) and $\|z^{t+1}-\bar{y}\|\leq\frac{\epsilon}{2}$ , we obtain further that

[TABLE]

Here, $B$ represents the closed unit ball and $B_{\epsilon}(\widebar{y})$ represents the closed ball with center $\widebar{y}$ and radius $\epsilon$ . Furthermore, we see from (43), (51) and (52) that

[TABLE]

On the other hand, in view of (48), (50) and (52), we can apply (44) with $x=x^{t}$ , $z=x^{t+1}$ and $v=\frac{z^{t+1}-x^{t+1}}{\|z^{t+1}-x^{t+1}\|}$ to obtain

[TABLE]

where the second inequality follows from (49). Adding (53) and (54), we obtain

[TABLE]

which proves (45).

Note from $c_{0}=c+2\delta$ with $c\in(\bar{c},1)$ and $\delta\in[0,\frac{1-c}{2})$ that $c_{0}\in(0,1)$ . Choose initial points $x^{0}$ and $z^{0}$ such that $\gamma:=\|x^{0}-\widebar{y}\|+\|z^{0}-x^{0}\|<\frac{(1-c_{0})\epsilon}{4}$ . Next, we prove the following inequalities by induction:

[TABLE]

First, we prove that the above three inequalities hold for $t=0$ . Note from $c_{0}\in(0,1)$ , the $z$ -update in (21) and the definition of $\gamma$ that

[TABLE]

which proves (55) and (56) for $t=0$ . Then we see from $\|z^{1}-x^{0}\|<\frac{\epsilon}{2}$ , $\|z^{1}-\widebar{y}\|<\frac{\epsilon}{2}$ and (45) that

[TABLE]

which proves (57) for $t=0$ . To prove by induction, we assume that (55), (56) and (57) hold for some $t\geq 0$ . We know from the $z$ -update, (55) and (57) that

[TABLE]

This together with (56) and (57) implies

[TABLE]

We then see from $\|z^{t+2}-x^{t+1}\|<\frac{\epsilon}{2}$ , $\|z^{t+2}-\widebar{y}\|<\frac{\epsilon}{2}$ and (45) that

[TABLE]

Thus, we proved (55), (56) and (57) for $t+1$ . This completes the induction.

Now we prove that the sequence $\{z^{0},x^{0},z^{1},x^{1}\cdots\}$ is a Cauchy sequence. For any $t$ and $k>s\geq t$ , we know from (55) and (57) that

[TABLE]

Furthermore, by using (57), we have

[TABLE]

These prove that the sequence $\{z^{0},x^{0},z^{1},x^{1}\cdots\}$ is a Cauchy sequence. Therefore, it converges to some $y^{*}\in\Omega_{1}\cap\Omega_{2}$ and we have for any $t$ that

[TABLE]

*Thus the sequence $\{z^{0},x^{0},z^{1},x^{1}\cdots\}$ converges $R$ -linearly. This completes the proof. *

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Chu, N. Del Buono, L. Lopez and T. Politi , On the low-rank approximation of data on the unit sphere , SIAM J. Matrix Anal. Appl., 27 (2005), pp. 46–60.
2[2] C. Eckart and G. Young , The approximation of one matrix by another of lower rank , Psychometrika, 1 (1936), pp. 211–218.
3[3] M. Fazel , Matrix Rank Minimization with Applications , Ph D thesis, Elec. Eng. Dept., Stanford University, 2002.
4[4] G. Golub and C. Van Loan , Matrix Computations , Johns Hopkins University Press, 1996.
5[5] M. Ishteva, K. Usevich and I. Markovsky , Factorization approach to structured low-rank approximation with applications , SIAM J. Matrix Anal. Appl., 35 (2014), pp. 1180–1204.
6[6] N. K. Karmarkar and Y. N. Lakshman , On approximate GC Ds of univariate polynomials , J. Symbolic Comput., 26 (1998), pp. 653–666.
7[7] A. S. Lewis, D. R. Luke and J. Malick , Local linear convergence for alternating and averaged nonconvex projections , Found. Comput. Math., 9 (2007), pp. 485–513.
8[8] T. Liu, T. K. Pong and A. Takeda , A successive difference-of-convex approximation method for a class of nonconvex nonsmooth optimization problems , To appear in Math. Program., DOI:10.1007/s 10107-018-1327-8.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A hybrid penalty method for a class of optimization problems with multiple rank constraints

Abstract

keywords:

1 Introduction

2 Notation and preliminaries

Definition 2.1** (Prox-regular sets).**

Definition 2.2** (Pseudo-projection).**

Lemma 2.3**.**

Lemma 2.4**.**

Proof 2.5**.**

3 A hybrid penalty method

3.1 Stage 1: A penalty method

3.2 Stage 2: Post-processing scheme

3.3 Hybrid penalty method for (1) and convergence analysis

3.3.1 Well-definedness of (7) and (12)

Theorem 3.1**.**

Proof 3.2**.**

3.3.2 Convergence analysis for the penalty method in Algorithm 2

Theorem 3.3**.**

Proof 3.4**.**

3.3.3 Convergence analysis of the post-processing method in Algorithm 2

Theorem 3.5**.**

Proof 3.6**.**

Theorem 3.7**.**

4 Subproblem: pseudo-projection

4.1 Smoothness of Ψ\PsiΨ

Lemma 4.1**.**

Proof 4.2**.**

Theorem 4.3**.**

Proof 4.4**.**

4.2 Stationarity and improvement of function value

Theorem 4.5**.**

Proof 4.6**.**

Theorem 4.7**.**

Proof 4.8**.**

Remark 4.9** (Obtaining pseudo-projection in cases (23) or (24)).**

4.3 Conjecture related to Theorem 4.5

Conjecture 4.10**.**

Proposition 4.11**.**

Proof 4.12**.**

5 Numerical experiments

6 Concluding remarks

Appendix A Proof of Theorem 3.7

Lemma A.1**.**

Lemma A.2**.**

Proof A.3**.**

Definition 2.1 (Prox-regular sets).

Definition 2.2 (Pseudo-projection).

Lemma 2.3.

Lemma 2.4.

Proof 2.5.

Theorem 3.1.

Proof 3.2.

Theorem 3.3.

Proof 3.4.

Theorem 3.5.

Proof 3.6.

Theorem 3.7.

4.1 Smoothness of $\Psi$

Lemma 4.1.

Proof 4.2.

Theorem 4.3.

Proof 4.4.

Theorem 4.5.

Proof 4.6.

Theorem 4.7.

Proof 4.8.

Remark 4.9 (Obtaining pseudo-projection in cases (23) or (24)).

Conjecture 4.10.

Proposition 4.11.

Proof 4.12.

Lemma A.1.

Lemma A.2.

Proof A.3.