TL;DR
This paper introduces BPALM and A-BPALM algorithms using Bregman distances for solving structured nonconvex problems, with convergence analysis and application to orthogonal nonnegative matrix factorization.
Contribution
It proposes novel multi-block proximal algorithms with convergence guarantees for nonconvex problems, applied specifically to ONMF with closed-form solutions.
Findings
Sequences converge to critical points of the objective
Global convergence under KL inequality
Preliminary numerical results on ONMF
Abstract
We introduce and analyze BPALM and A-BPALM, two multi-block proximal alternating linearized minimization algorithms using Bregman distances for solving structured nonconvex problems. The objective function is the sum of a multi-block relatively smooth function (i.e., relatively smooth by fixing all the blocks except one) and block separable (nonsmooth) nonconvex functions. It turns out that the sequences generated by our algorithms are subsequentially convergent to critical points of the objective function, while they are globally convergent under KL inequality assumption. Further, the rate of convergence is further analyzed for functions satisfying the {\L}ojasiewicz's gradient inequality. We apply this framework to orthogonal nonnegative matrix factorization (ONMF) that satisfies all of our assumptions and the related subproblems are solved in closed forms, where some preliminary…
| Penalty par. | BPALM | A-BPALM1 | A-BPALM2 | |||
|---|---|---|---|---|---|---|
| 1 | ||||||
| 10 | ||||||
| 100 | ||||||
| 1000 | ||||||
| 10000 | ||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Multi-block Bregman proximal alternating linearized minimization and its application to orthogonal nonnegative matrix factorization
Masoud Ahookhosh1, Le Thi Khanh Hien2. Nicolas Gillis2, and Panagiotis Patrinos1
Abstract.
We introduce and analyze BPALM and A-BPALM, two multi-block proximal alternating linearized minimization algorithms using Bregman distances for solving structured nonconvex problems. The objective function is the sum of a multi-block relatively smooth function (i.e., relatively smooth by fixing all the blocks except one) and block separable (nonsmooth) nonconvex functions. It turns out that the sequences generated by our algorithms are subsequentially convergent to critical points of the objective function, while they are globally convergent under KL inequality assumption. Further, the rate of convergence is further analyzed for functions satisfying the Łojasiewicz’s gradient inequality. We apply this framework to orthogonal nonnegative matrix factorization (ONMF) that satisfies all of our assumptions and the related subproblems are solved in closed forms, where some preliminary numerical results is reported.
1Department of Electrical Engineering (ESAT-STADIUS) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
2Department of Mathematics and Operational Research, University of Mons, Belgium
MA and PP acknowledge the support by the Research Foundation Flanders (FWO) research projects G086518N and G086318N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS project no 30468160 (SeLMA). NG also acknowledges the support by the European Research Council (ERC starting grant no 679515).
1. Introduction
Consider the structured nonsmooth nonconvex minimization problem
[TABLE]
where we will systematically assume the following hypotheses (see Section 2 for details):
Assumption I** (requirements for composite minimization (1.1)).**
- a1
* is proper and lower semicontinuous (lsc);* 2. a2
* is and -smooth relative to ; here ;* 3. a3
* is multi-block strictly convex, -coercive and essentially smooth;* 4. a4
* has a nonempty set of minimizers, i.e., , and ;* 5. a5
the first-order oracles of , , and are available.
Although, the problem (1.1) has a simple structure, it covers a broad range of optimization problems arising in signal and image processing, statistical and machine learning, control and system identification. Consequently, needless to say, there is a huge number of algorithmic studies around solving the optimization problems of the form (1.1). Among all of such methodologies, we are interested in the class of alternating minimization algorithms such as block coordinate descent [13, 16, 39, 45, 50, 51, 57, 58], block coordinate [29, 30, 38], and Gauss-Seidel methods [9, 17, 33], which assumes that all blocks are fixed except one and solves the corresponding auxiliary problem with respect to this block, update the latter block, and continue with the others. In particular, the proximal alternating minimization has received much attention in the last few years; see for example [4, 7, 8, 5, 6, 14]. Recently, the proximal alternating linearized minimization and its variation has been developed to handle (1.1); see for example [21, 48, 53].
Traditionally, the Lipschitz (Hölder) continuity of partial gradients of in (1.1) is a necessary tool for providing the convergence analysis of optimization algorithms; see, e.g., [21, 48]. It is, however, well-known that it is not the Lipschitz (Hölder) continuity of gradients playing a key role in such analysis, but one of its consequence: an upper estimation of including a Bregman distance called descent lemma ; cf. [11, 43]. This idea is central to convergence analysis of many optimization schemes requiring such an upper estimation; see, e.g., [2, 10, 11, 22, 55, 34, 35, 43]. In this paper, we propose a multi-block extension of the descent lemma given in [11, 43] and propose a Bregman proximal alternating linearized minimization (BPALM) algorithm and its adaptive version (A-BPALM) for (1.1).
1.1. Contribution
Our contribution is summarized as follows:
(Bregman proximal alternating linearized minimization) We introduce BPALM, a multi-block generalization of the proximal alternating linearized minimization (PALM) [21] using Bregman distances, and its adaptive version (A-BPALM). To do so, we extend the notion of relative smoothness [11, 43] to its multi-block counterpart to support a structured problem of the form (1.1). Owing to multi-block relative smoothness of , unlike PALM, our algorithm does not need to know the local Lipschitz moduli of partial gradients () and their lower and upper bounds, which are hard to provide in practice. 2. 2)
(Efficient framework for ONMF) Exploiting a suitable kernel for Bregman distance, it turns out that the objective of orthogonal nonnegative matrix factorization (ONMF) is multi-block relatively smooth, and the subproblems of our algorithms are solved in closed forms making them suitable for large-scale data analysis problems. To the best of our knowledge, BPALM and A-BPALM are the first algorithms with rigorous convergence theory for ONMF.
1.2. Related works
Closly related to our framework, there are two papers [40, 60]. However, we notice that [60] uses a sum separable kernel function which is just a special case of our multi-block kernel functions (see Example 2.2), and the paper only provides a limited convergence theory. Regarding [40], an algorithm (named B-PALM) proposed that is just a special case of our BPALM when , , and , which restrict its applications. We stress that involving the block separable nonsmooth nonconvex functions and considering make our analysis different from those of [40].
1.3. Organization
This paper has four sections, besides this introductory section. In Section 2, we introduce the notion of multi-block relative smoothness, and verify the fundamental properties of Bregman proximal alternating linearized mapping. In Section 3, we introduce BPALM and A-BPALM and investigate their convergence analysis. In Section 4, we show that ONMF satisfies our assumptions, the related subproblems are solved in closed forms, and report our numerical results. Finally, Section 5 delivers some conclusions.
1.4. Notation
We denote by the extended-real line. For the identity matrix , we set such that . The open ball of radius centered in is denoted as . The set of cluster points of is denoted as . A function is proper if and , in which case its domain is defined as the set . For , is the -(sub)level set of ; and are defined similarly. We say that is level bounded if is bounded for all . A vector is a subgradient of at , and the set of all such vectors is called the subdifferential [52, Definition 8.3], i.e.
[TABLE]
2. Multi-block Bregman proximal alternating linearized mapping
We first establish the notion multi-block relative smoothness, which is an extension of the relative smoothness [11, 43] for problems of the form (1.1). We then introduce Bregman alternating linearized mapping and study some of its basic properties. For notation clarity, we will use bold lower-case letters (e.g., , , ) for vectors in and use normal lower-case letters (e.g., , , ) for vectors in .
In order to extend the definition of Bregman distances for the multi-block problem (1.1), we first need to introduce the notion of multi-block kernel functions, which will coincide with the standard one (cf. [2, Definition 2.1]) if .
Definition 2.1** (multi-block convexity and kernel function).**
Let be a proper and lsc function with and such that . For a fixed vector , we define the function given by
[TABLE]
Then, we say that is
multi-block (strongly/strictly) convex* if the function is (strongly/strictly) convex for all and ;* 2. 2)
multi-block locally strongly convex* around if, for , there exists and such that*
[TABLE] 3. 3)
a multi-block kernel function if is multi-block convex and is -coercive for all and , i.e., ; 4. 4)
multi-block essentially smooth*, if for every sequence converging to a boundary point of , we have for all ;* 5. 5)
of multi block Legendre type if it is multi-block essentially smooth and multi-block strictly convex.
Example 2.2** (popular kernel functions).**
There are many kernel functions satisfying Definition 2.1. For example, for , energy, Boltzmann-Shannon entropy, Fermi-Dirac entropy (cf. [12, Example 2.3]) and several examples in [43, Section 2]; and for see two examples in [40, Section 2]. Two important classes of multi-block kernels are sum separable kernels, i.e., , and product separable kernels, i.e., , see such a kernel for ONMF in Proposition 4.1. ∎
We now give the definition of Bregman distances (cf. [26]) for multi-block kernels.
Definition 2.3** (Bregman distance).**
For a kernel function , the Bregman distance is given by
[TABLE]
Fixing all blocks except the -th one, the Bregman distance with respect to this block is given by
[TABLE]
which measures the proximity between and with respect to the -th block of variables. Moreover, the kernel is multi-block convex if and only if for all and and . Note that if is multi-block strictly convex, then () if and only if .
We are now in a position to present the notion of multi-block relative smoothness, which is the central tool for our analysis in Section 3.
Definition 2.4** (multi-block relative smoothness).**
Let be a multi-block kernel and let be a proper and lower semicontinuous function. If there exists () such that the functions given by
[TABLE]
are convex for all and , then, is called -smooth relative to .
Note that if , the multi-block relative smoothness is reduced to standard relative smoothness, which was introduced only recently in [11, 43]. In this case, if is -Lipschitz continuous, then both and are convex, i.e., the relative smoothness of generalizes the notions of Lipschitz continuity using Bregman distances. If , this definition will be reduced to the relative bi-smoothness given in [40] for .
We next characterize the notion of multi-block relative smoothness.
Proposition 2.5** (characterization of multi-block relative smoothness).**
Let be a multi-block kernel and let be a proper lower semicontinuous function and . Then, the following statements are equivalent:
- (a)
-smooth relative to ; 2. (b)
for all and ,
[TABLE] 3. (c)
for all and ,
[TABLE] 4. (d)
if and for all , then
[TABLE]
for .
{proof}
Fixing all the blocks except one of them, the results can be concluded in the same way as [43, Proposition 1.1].
2.1. Bregman proximal alternating linearized mapping
Recall that if , for a kernel function and a proper lower semicontinuous function , the Bregman proximal mapping is given by
[TABLE]
which is a generalization of the classical one using the Bregman distance (2.2) in place of the Euclidean distance; see, e.g., [27] and references therein. We note that
[TABLE]
which implies . The function is -prox-bounded if there exists such that for some ; cf. [2]. We next extend this definition to our multi-block setting.
Definition 2.6** (multi-block -prox-boundedness).**
A function is multi-block -prox-bounded if for each there exists and such that
[TABLE]
The supremum of the set of all such is the threshold of the -prox-boundedness, i.e.,
[TABLE]
For the problem (1.1), we have leading to
[TABLE]
i.e., we therefore denote . If is multi-block -prox-bounded for , so is for all . We next present equivalent conditions to this notion.
Proposition 2.7** (characteristics of multi-block -prox-boundedness).**
For a multi-block kernel function and proper and lsc functions with , the following statements are equivalent:
- (a)
* is multi-block -prox-bounded;* 2. (b)
for all and given in (2.1), is bounded below on for some ; 3. (c)
for all , .
{proof}
Suppose and let . Then, for all , it holds that
[TABLE]
Notice that is strictly convex and coercive, and as such is lower bounded. Conversely, suppose that . Then, from (2.8), we obtain
[TABLE]
which is finite, owing to -coercivity of .
Suppose that . Since is -coercive, we have
[TABLE]
Conversely, suppose . Then, there exists such that whenever . In particular
[TABLE]
where the last inequality follows from coercivity of . Since owing to lower semicontinuity, we conclude that is lower bounded on .
Let us now define the function as
[TABLE]
and the set-valued Bregman proximal alternating linearized mapping as
[TABLE]
which reduces to the Bregman forward-backward splitting mapping if ; cf. [22, 2].
Remark 2.8** (majorization model).**
Note that invoking (b), the multi-block ()-relative smoothness assumption of entails a majorization model
[TABLE]
for . ∎
In the next lemma, we show that the cost function is monotonically decreasing by minimizing the model (2.9) with respect to each block of variables.
Lemma 2.9** (Bregman proximal alternating inequality).**
Let the conditions in Assumption I hold, and let with . Then,
[TABLE]
for all .
{proof}
For , (2.10) is simplified in the form
[TABLE]
Considering , we have
[TABLE]
Since is -smooth relative to , it follows from (b) for and that
[TABLE]
giving (2.11).
Recall that a function with values is level-bounded in locally uniformly in if for each and there is a neighborhood of along with a bounded set such that for all , cf. [52]. Using this definition, the fundamental properties of the mapping are investigated in the subsequent result.
Proposition 2.10** (properties of Bregman proximal alternating linearized mapping).**
Under conditions given in Assumption I for , the following statements are true:
* is nonempty, compact, and outer semicontinuous (osc) for all ;* 2. 2)
; 3. 3)
If , then ;
{proof}
For a fixed and a vector , let us define the function given by
[TABLE]
Since and are proper and lsc, so is on the set , for a constant . We show that is level-bounded in locally uniformly in . If it is not, then there exists , with , and such that with and . This guarantees that, for sufficiently large , , i.e., and
[TABLE]
Setting , (b) ensures that there exists a constant such that
[TABLE]
Subtracting the last two inequalities, it holds that
[TABLE]
Expanding , dividing both sides by , and taking limit from both sides of this inequality as , it can be deduced that
[TABLE]
This leads to the contradiction , which implies that is level-bounded. Therefore, all assumptions of the parametric minimization theorem [36, Theorem 2.2 and Corollary 2.2] are satisfied, i.e., Item 2). If , then Lemma 2.9 implies that for , i.e., , the second inclusion follows from a4.
Remark 2.11** (sum or product separable kernel).**
Let us observe the following.
If is an additive separable function, i.e., , then (2.10) can be written in the form
[TABLE] 2. 2)
If is product separable, i.e., , then
[TABLE]
where and .
∎
3. Multi-block Bregman proximal alternating linearized minimization
We here introduce a multi-block proximal alternating linearized minimization algorithm and investigate its subsequential and global convergence, along with its convergence rate.
For a given point , we set
[TABLE]
i.e., and . Using this notation and (2.10), we next introduce the multi-block Bregman proximal alternating linearized minimization (BPALM) algorithm.
We note that each iteration of BPALM requires one call of the first-order oracle for the information needed in (3.1), and the iteration (3.1) are well-defined by Proposition 2.10. In addition, notice that if , this algorithm reduces to the common (Bregman) proximal gradient (forward-backward) method [11, 15, 22]; if , , and , then it reduces to B-PALM [40]; if and , it reduces to PALM [21]; if , then this algorithm is reduced to C-PALM [53].
We begin with showing some basic properties of the sequence generated by BPALM, involving a sufficient decrease condition.
Proposition 3.1** (sufficient decrease condition).**
Let Assumption I hold, and let be generated by BPALM. Then, the following statements are true:
the sequence is nonincreasing and
[TABLE]
where ; 2. 2)
we have
[TABLE]
i.e., for .
{proof}
Plugging and into Lemma 2.9, it holds that
[TABLE]
Summing up both sides of (3.4) from to , it follows that
[TABLE]
giving (3.2). Let us sum up both sides of (3.2) from to :
[TABLE]
Taking the limit as , (3.3) holds true. Together with , this proves the claim.
Let us consider the condition
[TABLE]
as a stopping criterion, for the accuracy parameter . Then, the first main consequence of Proposition 3.1 will provide us the iteration complexity of BPALM, which is the number of iterations needed for the stopping criterion (3.5) to be satisfied.
Corollary 3.2** (iteration complexity).**
Let Assumption I hold, and let be generated by BPALM, and let (3.5) be the stopping criterion. Then, BPALM will be terminated within iterations.
{proof}
Summing both sides of (3.2) over the first iterations and telescoping the right hand side, it holds that
[TABLE]
Assuming that for all -th iterations the stopping criterion (3.5) is not satisfied, i.e., , which leads to , giving the desired result.
In order to show the subsequential convergence of the sequence generated by BPALM, the next proposition will provide a lower bound for iterations gap using the subdifferential of .
Proposition 3.3** (subgradient lower bound for iterations gap).**
Let Assumption I hold, and let be generated by BPALM that we assume to be bounded. For a fixed , we define
[TABLE]
Then, and
[TABLE]
with in which and are Lipschitz moduli of , () on bounded sets, and and are Lipschitz moduli of , on bounded sets, respectively.
{proof}
The optimality conditions for (3.1) ensures that there exists such that
[TABLE]
leading to
[TABLE]
On the other hand, owing to [5, Proposition 2.1], the subdifferential of is given by
[TABLE]
i.e., for ,
[TABLE]
which means . It follows from the Lipschitz continuity of , on bounded sets and the assumption of being bounded that there exist , , and such that
[TABLE]
for , and
[TABLE]
Invoking the last two inequalities, it can be concluded that
[TABLE]
as claimed.
Next, we proceed to derive the subsequential convergence of the sequence generated by BPALM: every cluster point of is a critical point of . Further, we explain some fundamental properties of the set of all cluster points of the sequence .
Theorem 3.4** (subsequential convergence and properties of ).**
Let Assumption I hold, let the kernel be locally multi-block strongly convex, and let be generated by BPALM that we assume to be bounded. Then the following statements are true:
; 2. 2)
; 3. 3)
* is a nonempty, compact, and connected set;* 4. 4)
the objective function is finite and constant on .
{proof}
For a limit point of the sequence , it follows from the boundedness of this sequence that there exists an infinite index set such that the subsequence converges to as . From the lower semicontinuity of () and for , it can be deduced that
[TABLE]
By (3.1), we get
[TABLE]
Using multi-block local strong convexity of around and invoking Item 2), there exist a neighborhood for , , and such that for and
[TABLE]
This indicates that the distance between two successive iterations goes to zero for large enough . Since the sequence is bounded, and are continuous, substituting for , taking the limit from both sides of the last inequality as , and (3.10), we come to
[TABLE]
and consequently,
[TABLE]
Further, Item 2) and Proposition 3.3 ensure and
[TABLE]
i.e., . Since the subdifferential mapping is closed, we have , giving Item 1).
Item 2) is a direct consequence of Item 1), and Item 3) and Item 4) can be proved in the same way as [21, Lemma 5(iii)-(iv)].
3.1. Global convergence under Kurdyka-Łojasiewicz inequality
This section is devoted to the global convergence of BPALM under Kurdyka-Łojasiewicz inequality.
Definition 3.5** (KL property).**
A proper and lsc function has the Kurdyka-Łojasiewicz property (KL property) at if there exist a concave desingularizing function (for some ) and neighborhood with , such that
; 2. 2)
* is of class with on ;* 3. 3)
for all such that it holds that
[TABLE]
The set of all functions satisfying these conditions is denoted by .
The first inequality of this type is given in the seminal work of Łojasiewicz [41, 42] for analytic functions, which we nowadays call Łojasiewicz’s gradient inequality. Later, Kurdyka [37] showed that this in equality is valid for functions whose graph belong to an -minimal structure (see its definition in [59]). The first extensions of the KL property to nonsmooth functions was given by Bolte et al. [19, 18, 20].
The following two facts constitutes the crucial steps toward the establishment of the global convergence of the sequence generated by BPALM.
Fact 3.6** (uniformized KL property).**
[21, Lemma 6]** Let be a compact set and be a proper and lower semicontinuous functions. Assume that is constant on and satisfies the KL property at each point of . Then, there exists a , , and such that for and all in the intersection
[TABLE]
we have
[TABLE]
Fact 3.7**.**
[23*, Lemma 2.3]**
Let and be the sequences in such that and for all in which . Then, .*
Our subsequent main result indicates that the sequence generated by BPALM converges to a critical point of if it satisfies the KL property; cf. Definition 3.5.
Theorem 3.8** (global convergence).**
Let Assumption I hold, let the kernels be multi-block globally strongly convex with modulus (), and let be generated by BPALM that we assume to be bounded. If is a KL function, then the following statements are true:
The sequence has finite length, i.e.,
[TABLE] 2. 2)
The sequence converges to a stationary point of .
{proof}
Let us define the sequence given by , which is decreasing by Item 1), i.e., . We now consider two cases: (i) there exists such that ; (ii) for all .
In Case (i), invoking Item 1) implies that for all . It follows from Item 2) and multi-block strong convexity of that
[TABLE]
implying for all , which leads to Item 1).
In Case (ii), it holds that for all . From Item 3), the set of limit points of is nonempty and compact and is finite and constant on due to Item 4). Moreover, the sequence is decreasing (Item 1)), i.e., for , there exists a such that for all . For , Item 2) implies that there exists such that for . Setting and according to Fact 3.6, there exist and a desingularization function such that for any element in
[TABLE]
the following inequality holds:
[TABLE]
Let us define . Then, it follows from the concavity of and Proposition 3.3 that
[TABLE]
with . Using the arithmetic and quadratic means inequality, and applying the arithmetic and geometric means inequality, it can be concluded that
[TABLE]
We now define the sequences and as
[TABLE]
where . According to Fact 3.7, we infer , which proves Item 1).
By (3.12), the sequence is a Cauchy sequence, i.e., it converges to a stationary point , giving the desired result.
Remark 3.9**.**
In Theorem 3.4 and Theorem 3.8, we implicitly assume that the sequence is bounded. This assumption is typical in convergence analysis of proximal-type algorithms for solving general non-convex non-smooth composite optimization problem, see e.g., [5, 22]. Proposition 3.1 shows that is non-increasing; hence, it is upper bounded by . Therefore, the sequence would be bounded if has bounded level sets and is bounded below. ∎
3.2. Convergence rate under Łojasiewicz-type inequality
We now investigate the convergence rate of the sequence generated by BPALM under KL inequality of Łojasiewicz type at ( with ), i.e., there exists such that
[TABLE]
The following fact plays a key role in studying the convergence rate of the sequence generated by BPALM, where its proof can be found in [3, Lemma 1] and [24, Lemma 12].
Fact 3.10** (convergence rate of a sequence with positive elements).**
Let be a sequence in and let and be some positive constants. Suppose that and that the sequence satisfies for all sufficiently large. Then, the following assertions hold:
If , the sequences converges to [math] in a finite number of steps; 2. 2)
If , the sequences converges linearly to [math] with rate , i.e., there exist and such that
[TABLE] 3. 3)
If , there exists such that for all sufficiently large
[TABLE]
We next derive the convergence rates of the sequences and under an additional assumption that the function satisfies the KL inequality of Łojasiewicz type.
Theorem 3.11** (convergence rate).**
Let Assumption I hold, let the kernel be multi-block globally strongly convex with modulus , and let the sequence generated by BPALM converges to . If satisfies KL inequality of Łojasiewicz type (3.15), then the following assertions hold:
If , then the sequences and converge in a finite number of steps to and , respectively; 2. 2)
If , then there exist , , , and such that
[TABLE] 3. 3)
If , then there exist , , and such that
[TABLE]
{proof}
The proof has two key parts.
In the first part, we show that there exists such that for all the following inequalities hold for :
[TABLE]
Let be as described in (3.15) and for all and . By the definitions of and in (3.14) and using (3.13), we get for all . Since is nonincreasing,
[TABLE]
Together with the arithmetic and quadratic means inequality, , and Item 1), this lead to
[TABLE]
On the other hand, for , we have
[TABLE]
This inequality, together with (3.17), yields
[TABLE]
leading to
[TABLE]
where and . Let us consider the nonlinear equation
[TABLE]
which has a solution at . For and , we assume that (3.18) holds and
[TABLE]
We now consider two cases: (a) ; (b) . In Case (a), if , then . If , then , i.e., . Therefore, it holds that . In Case (b), we have that
[TABLE]
i.e., . Then, it follows from (3.18) that (3.16) holds for all .
In the second part of the proof, we will show the assertions in the statement of the theorem. For as defined in Proposition 3.3, by Item 1), we infer
[TABLE]
with and for all . Hence, all assumptions of Fact 3.10 hold with . Therefore, our results follows from this fact and (3.16).
3.3. Adaptive BPALM
The tightness of the -th block upper estimation of the function given in (b) is dependent on the parameter ; however, in general, this parameter is a global information and it might not be tight locally, i.e., one may find a such that
[TABLE]
for all with a small enough . Consequently, the majorization model described by may not be tight enough, which will consequently lead to smaller stepsizes . In this case and in the case that are not available, one can retrieve them adaptively by applying a backtracking linesearch starting from a lower estimates; see, e.g., [1, 2, 44, 46, 56].
Putting together the above discussions, we propose an adaptive version of BPALM using a backtracking linesearch; see Algorithm 2.
We next provide an upper bound on the total number of calls of oracle after iterations of A-BPALM and those needed to satisfy (3.5).
Proposition 3.12** (worst-case oracle calls).**
Let be generated by A-BPALM. Then,
after at most iterations the linesearch (Lines 4 to 7 of A-BPALM) is terminated; 2. 2)
the number of oracle call after full cycle is bounded by
[TABLE] 3. 3)
the worst-case number of oracle calls to satisfy (3.5) is given by
[TABLE]
with .
{proof}
According to 7 and 10 of A-BPALM, we have , i.e.,
[TABLE]
giving Item 1). Hence, the total number of calls of oracle after iterations is given by
[TABLE]
giving Item 2).
Following the proof of Proposition 3.1 and since the sequence is increasing with respect to , it is easy to see that
[TABLE]
On the other hand, 7 implies that , , leading to
[TABLE]
Following the proof of Corollary 3.2, we have that BPALM will be terminated within iterations. Together with Item 2), this implies that Item 3) is true.
Choosing appropriate constants , Item 1) roughly speaking says that on average each full cycle of A-BPALM needs at most oracle calls. Furthermore, in light of (3.19), Proposition 3.1 holds true by replacing with . Considering this replacement, all the results of Proposition 3.3, Theorem 3.4, Theorem 3.8, and Theorem 3.11 remain valid for A-BPALM.
Remark 3.13** (A-BPALM variant).**
We here notice that one may change Line 5 of A-BPALM as “set ", which always start the backtracking procedure from and . It is easy to see the results of Proposition 3.12 are still valid for this variant of A-BPALM. ∎
4. Application to orthogonal nonnegative matrix factorization
A natural way of analyzing large data sets is finding an effective way to represent them using dimensionality reduction methodologies. Nonnegative matrix factorization (NMF) is one such technique that has received much attention in the last few years; see, e.g., [28, 31, 32] and the references therein. In order to extract hidden and important features from data, NMF decomposes the data matrix into two factor matrices (usually much smaller than the original data matrix) by imposing componentwise nonnegativity and (possibly) sparsity constraints on these factor matrices. More precisely, let the data matrix be where each represents some data point. NMF seeks a decomposition of into a nonnegative basis matrix and a nonnegative coefficient matrix such that
[TABLE]
where is the set of element-wise nonnegative matrices. Extensive research has been carried out on variants of NMF, and most studies in this area have focused on algorithmic developments, but with very limited convergence theory. This motivates us to study the application of BPALM and A-BPALM to a variant of NMF, namely orthogonal NMF (ONMF).
4.1. Orthogonal nonnegative matrix factorization
Besides the decomposition (4.1), the orthogonal nonnegative matrix factorization (ONMF) involves an additional orthogonality constraint leading to the constrained optimization problem
[TABLE]
where is the identity matrix. By imposing the matrix to be orthogonal (as well as nonnegative), ONMF imposes that each data points is only associated with one basis vector hence ONMF is closely related to clustering problems; see [49] and the references therein. Since the projection onto the set is costly, we here consider the penalized formulation
[TABLE]
for the penalty parameter . Introducing a product separable kernel, we next show that the objective function (4.3) is multi-block relatively smooth.
Proposition 4.1** (multi-block relative smoothness of ONMF objective).**
Let the function be a kernel given by
[TABLE]
Then the function given by is -smooth relative to with
[TABLE]
{proof}
Using partial derivatives , , and the Cauchy Schwarz inequality, it can be concluded that . On the other hand, and
[TABLE]
Together with (4.5), this yields
[TABLE]
which implies .
From and the definition of directional derivative, we obtain
[TABLE]
This, , basic properties of the trace, the Cauchy-Schwarz inequality, and the submultiplicative property of the Frobenius norm imply
[TABLE]
Plugging into the directional derivative definition, we come to
[TABLE]
implying
[TABLE]
Hence, it follows from (4.5) that
[TABLE]
i.e., , as claimed.
The unconstrained version of the ONMF problem (4.2) is given by
[TABLE]
where and are the indicator functions of the sets and , respectively. Comparing to (1.1), the next setting is recognized
[TABLE]
in which both and are nonsmooth and convex, and is -smooth relative to given in (4.4); cf. Proposition 4.1. For given and , applying BPALM and A-BPALM to (4.6), and should be computed efficiently, which we study next.
Theorem 4.2** (closed-form solutions of the subproblem (3.1) for ONMF).**
Let and be the kernel functions given by
[TABLE]
i.e., . For given and , the problem (4.6), and the subproblem (3.1), the following assertions hold:
For and , the iteration is given by
[TABLE] 2. 2)
For and , the iteration is given by
[TABLE]
with and
[TABLE]
where and \tau_{2}:=\nicefrac{{{\mathopen{}\left(-2\beta_{2}^{3}-27\alpha_{2}\big{\|}\max{\mathopen{}\left\{(\alpha_{2}\|V^{k}\|_{F}^{2}+\beta_{2})V^{k}-\mu_{2}{\nabla}_{V}f(U^{k+1},V^{k}),0{}\mid{}{}\right\}\mathclose{}}\big{\|}_{F}\right)\mathclose{}}}}{{27}}.
{proof}
Setting and , it follows from (2.12) that
[TABLE]
with , giving (4.7).
By setting and invoking (2.12), we infer
[TABLE]
Let us consider the normal cone (see [54, Corollary 3.5]), where denotes the Hadamard products given pointwise by for and . The first-order optimality conditions for the latter identity leads to with . Let us consider two cases: (i) ; (ii) . In Case (i), , i.e., . In Case (ii), if , then , which contradicts , i.e., . Combining both cases, we come to the equation
[TABLE]
i.e., there exists such that that eventually lead to
[TABLE]
which is a Cardano equation and its solution is given by (4.9).
4.2. Preliminary numerical experiment
In this section, we report preliminary numerical experiments with BPALM and two variants of A-BPALM, namely,
A-BPALM1: the algorithm A-BPALM; 2. 2)
A-BPALM2: the variant of A-BPALM as described in Remark 3.13.
Since the unconstrained ONMF problem (4.6) involves the penalty term , we also consider a “continuation" variant of these algorithm that starts from some , run one of the above-mentioned algorithms until some stopping criterion holds and save its best point, and then it increases the penalty parameter and run the algorithm with the starting point as the best point of the last call, and it continues the procedure until we stop the algorithm. We refer to this procedure as continuation, which we will describe next in more details.
In our Implementation all the codes were written in MATLAB (publicly available at https://github.com/MasoudAhoo/BPALM) and runs were performed on a MacBook Pro with 2,8 GHz Intel Core i7 CPU and 16 GB RAM. On the basis of our preliminary experiments, we here set to provide the relative smoothness constants as described in (4.5), and the related step-sizes are computed by , when is set as the machine precision. For A-BPALM1 and A-BPALM1, we set , and we also set for A-BPALM1 and for A-BPALM2. For the continuation version, we set .
We first report the experiment on a synthetic data set with . Our synthetic data set is generated as follows. We use the MATLAB command to generate random nonnegative matrices and , then we generate a random orthogonal nonnegative matrix . Next, we set to obtain the -by- orthogonal decomposable matrix , and finally add 5% of noise by . Now, we use SVD-based initialization for providing starting points for our algorithms, see [25]. We here run our algorithms with both fixed penalty parameter and with the continuation scheme. For fixed penalty versions, we set , and for continuation versions we started with and stopped the inner algorithms every 3 seconds and increased by factor . We stopped the algorithms after 15 seconds of the running time.
The results of our implementation are illustrated in Figure 1. In this figure, Subfigure (a) stands for fixed penalty versions while Subfigure (b) stands for continuation versions. Hence, on Subfigure (b), the penalty is progressively increased. We make two observations: (i) In both cases A-BPALM1 and A-BPALM2 outperform BPALM while A-BPALM1 is the best among them; (ii) The continuation schemes perform much better than the fixed penalty versions, especially for A-BPALM1. In fact, although the curve on Subfigure (b) corresponds to a larger value of , A-BPALM1 achieves a much lower function value; namely around 20 on Subfigure (a) vs. 0.2 on Subfigure (b). The reason is that increasing leads to a better solution where the factor is closer to orthogonality hence closer to the ground truth.
We next report the performance of our algorithms on the Hubble telescope data set which is taken from [47]. Since the continuation versions of our algorithms perform better, we here only apply the continuation versions of BPALM, A-BPALM1, and A-BPALM2. We use the SVD-based initialization as in [49]. In this problem, each row of the matrix is a vectorized image of the Hubble telescope at a given wavelength for a total of wavelengths. Each image contains pixels. Since each pixel in the image contains mostly a single material, it makes sense to use ONMF to cluster the pixel according to the material they contain (see Figure 2 for an illustration). For this application problem, we report the final relative fidelity and orthogonal errors, i.e.,
[TABLE]
with respect to several initial values for the penalty parameter in the continuation procedure Algorithm 3. The results of our implementations are reported in Table 1 and the final outputs of the algorithms, along with the ground true Hubble image, are illustrated in Figure 2.
From Table 1, we observe that A-BPALM1 attains the better error than BPALM and A-BPALM2 in the sense of both the relative fidelity and orthogonal errors. Further, we observe that the orthogonal errors produced by the algorithms are decreasing by increasing the initial penalty parameter . From Figure 2, A-BPALM1 provides slightly better quality image compared to BPALM and A-BPALM2 (look for example at the first basis image).
5. Conclusion
We have analysed two new alternating linearized minimization algorithms called BPALM and A-BPALM for solving the popular nonconvex nonsmooth optimization problem (1.1). Convergence analysis including the subsequential convergence, the global convergence and the convergence rate of the proposed algorithms is studied under the framework of multi-block relative smoothness and multi-block kernel functions. We emphasize that, to the best of our knowledge, BPALM and A-BPALM are the first algorithms with rigorous convergence guarantee for solving ONMF in the literature. We employ BPALM and A-BPALM to solve the orthogonal nonnegative matrix factorization problem. Some preliminary numerical tests are provided to illustrate the performance of our algorithms. A comprehensive numerical experiments with several data sets and comparison with state-of-the-art algorithms are out of the scope of the current paper, which we aim for future work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Ahookhosh , Accelerated first-order methods for large-scale convex optimization: nearly optimal complexity under strong convexity , Mathematical Methods of Operations Research, 89 (2019), pp. 319–353.
- 2[2] M. Ahookhosh, A. Themelis, and P. Patrinos , Bregman forward-backward splitting for nonconvex composite optimization: superlinear convergence to nonisolated critical points , ar Xiv:1905.11904, (2019).
- 3[3] F. J. A. Artacho, R. M. Fleming, and P. T. Vuong , Accelerating the DC algorithm for smooth functions , Mathematical Programming, 169 (2018), pp. 95–118.
- 4[4] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran , Alternating proximal algorithms for weakly coupled convex minimization problems. applications to dynamical games and PDE’s , Journal of Convex Analysis, 15 (2008), p. 485.
- 5[5] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran , Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality , Mathematics of Operations Research, 35 (2010), pp. 438–457.
- 6[6] H. Attouch, J. Bolte, and B. F. Svaiter , Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods , Mathematical Programming, 137 (2013), pp. 91–129.
- 7[7] H. Attouch, P. Redont, and A. Soubeyran , A new class of alternating proximal minimization algorithms with costs-to-move , SIAM Journal on Optimization, 18 (2007), pp. 1061–1081.
- 8[8] H. Attouch and A. Soubeyran , Inertia and reactivity in decision making as cognitive variational inequalities , Journal of Convex Analysis, 13 (2006), p. 207.
