A learning-enhanced projection method for solving convex feasibility   problems

Janosch Rieger

arXiv:1905.00196·math.OC·June 18, 2020

A learning-enhanced projection method for solving convex feasibility problems

Janosch Rieger

PDF

TL;DR

This paper introduces a generalized projection method for convex feasibility problems that learns from past projection steps to improve decision-making, with proven convergence and initial numerical results.

Contribution

It presents a novel learning-enhanced projection algorithm that adapts based on previous steps, extending traditional cyclic projection methods.

Findings

01

Proven convergence of the proposed algorithm.

02

Initial numerical experiments demonstrate its effectiveness.

03

The method adapts projections based on learned geometric information.

Abstract

We propose a generalization of the method of cyclic projections, which uses the lengths of projection steps carried out in the past to learn about the geometry of the problem and decides on this basis which projections to carry out in the future. We prove the convergence of this algorithm and illustrate its behavior in a first numerical study.

Figures9

Click any figure to enlarge with its caption.

Equations88

proj (x, C) := argmin_{z \in C} ∥ x - z ∥

proj (x, C) := argmin_{z \in C} ∥ x - z ∥

i_{1} = m, i_{ℓ} = n, and D_{i_{s}, i_{s + 1}} > 0 \forall s \in {1, \dots, ℓ - 1} .

i_{1} = m, i_{ℓ} = n, and D_{i_{s}, i_{s + 1}} > 0 \forall s \in {1, \dots, ℓ - 1} .

D_{m, n}^{\leftrightarrow} = {1, 0, 0 < ∣ n - m ∣ \leq ω or N + m - n \leq ω or N + n - m \leq ω, otherwise

D_{m, n}^{\leftrightarrow} = {1, 0, 0 < ∣ n - m ∣ \leq ω or N + m - n \leq ω or N + n - m \leq ω, otherwise

D_{m, n}^{\to} = {1, 0, 0 < n - m \leq ω or N + n - m \leq ω, otherwise

D_{m, n}^{\to} = {1, 0, 0 < n - m \leq ω or N + n - m \leq ω, otherwise

D_{m, n}^{∠} = ⎩ ⎨ ⎧ ≫ 1, 1, ≪ 1, 0, ∠ (C_{m}, C_{n}) known to be large, ∠ (C_{m}, C_{n}) unknown, ∠ (C_{m}, C_{n}) known to be small, m = n

D_{m, n}^{∠} = ⎩ ⎨ ⎧ ≫ 1, 1, ≪ 1, 0, ∠ (C_{m}, C_{n}) known to be large, ∠ (C_{m}, C_{n}) unknown, ∠ (C_{m}, C_{n}) known to be small, m = n

φ : {1, \dots, N} \times \mathbbm R_{\geq 0}^{N \times N} \to \mathbbm R_{\geq 0}

φ : {1, \dots, N} \times \mathbbm R_{\geq 0}^{N \times N} \to \mathbbm R_{\geq 0}

φ_{m i n} (m, D) := β {n : D_{m, n} > 0} min D_{m, n}

φ_{m i n} (m, D) := β {n : D_{m, n} > 0} min D_{m, n}

φ_{av} (m, D) := \frac{β}{# { n : D _{m, n} > 0 }} {n : D_{m, n} > 0} \sum D_{m, n}

φ_{av} (m, D) := \frac{β}{# { n : D _{m, n} > 0 }} {n : D_{m, n} > 0} \sum D_{m, n}

D_{m, n}^{k} > 0 if and only if D_{m, n}^{0} > 0,

D_{m, n}^{k} > 0 if and only if D_{m, n}^{0} > 0,

x_{k + 1} = proj (x_{k}, C_{j_{k}}) \forall k \in \mathbbm N .

x_{k + 1} = proj (x_{k}, C_{j_{k}}) \forall k \in \mathbbm N .

∥ x_{k + 1} - z ∥^{2} \leq ∥ x_{k} - z ∥^{2} - ∥ x_{k + 1} - x_{k} ∥^{2} \forall k \in \mathbbm N,

∥ x_{k + 1} - z ∥^{2} \leq ∥ x_{k} - z ∥^{2} - ∥ x_{k + 1} - x_{k} ∥^{2} \forall k \in \mathbbm N,

∥ x_{k + 1} - z ∥ \leq ∥ x_{k} - z ∥ \leq ∥ x_{0} - z ∥ \forall k \in \mathbbm N,

x_{k + 1} = proj (x_{k}, C_{j_{k}}) \forall k \in \mathbbm N

x_{k + 1} = proj (x_{k}, C_{j_{k}}) \forall k \in \mathbbm N

# {k \in \mathbbm N : j_{k} = j} = \infty \forall j \in {1, \dots, \mathbbm N} .

# {k \in \mathbbm N : j_{k} = j} = \infty \forall j \in {1, \dots, \mathbbm N} .

ℓ \to \infty lim ∥ x_{k_{ℓ}} - x^{*} ∥ = 0.

ℓ \to \infty lim ∥ x_{k_{ℓ}} - x^{*} ∥ = 0.

j_{k_{ℓ_{m}}} = j^{*} \forall m \in \mathbbm N .

j_{k_{ℓ_{m}}} = j^{*} \forall m \in \mathbbm N .

J^{*}:=\big{\{}j\in\{1,\ldots,N\}:x^{*}\in C_{j}\big{\}},\quad J_{*}=\{1,\ldots,N\}\setminus J^{*}.

J^{*}:=\big{\{}j\in\{1,\ldots,N\}:x^{*}\in C_{j}\big{\}},\quad J_{*}=\{1,\ldots,N\}\setminus J^{*}.

k_{ℓ}^{''} := min {k \in \mathbbm N : k > k_{ℓ}^{'}, j_{k} \in J_{*}},

k_{ℓ}^{''} := min {k \in \mathbbm N : k > k_{ℓ}^{'}, j_{k} \in J_{*}},

k_{ℓ + 1}^{'} := min {k_{ℓ_{m}} : m \in \mathbbm N, k_{ℓ_{m}} > k_{ℓ}^{''}}

k_{0}^{'} < k_{0}^{''} < k_{1}^{'} < k_{1}^{''} < \dots

k_{0}^{'} < k_{0}^{''} < k_{1}^{'} < k_{1}^{''} < \dots

dist (x^{*}, C_{j}) \geq 2 ε \forall j \in J_{*} .

dist (x^{*}, C_{j}) \geq 2 ε \forall j \in J_{*} .

∥ x_{k_{ℓ}^{'}} - x^{*} ∥ \leq ε \forall ℓ \geq ℓ^{*} .

∥ x_{k_{ℓ}^{'}} - x^{*} ∥ \leq ε \forall ℓ \geq ℓ^{*} .

∥ x_{k} - x^{*} ∥ \leq ∥ x_{k_{ℓ}^{'}} - x^{*} ∥ \leq ε \forall ℓ \geq ℓ^{*}, \forall k \in [k_{ℓ}^{'}, k_{ℓ}^{''}) .

∥ x_{k} - x^{*} ∥ \leq ∥ x_{k_{ℓ}^{'}} - x^{*} ∥ \leq ε \forall ℓ \geq ℓ^{*}, \forall k \in [k_{ℓ}^{'}, k_{ℓ}^{''}) .

∥ x_{k_{ℓ}^{''}} - x_{k_{ℓ}^{''} - 1} ∥ \geq ∥ x_{k_{ℓ}^{''}} - x^{*} ∥ - ∥ x^{*} - x_{k_{ℓ}^{''} - 1} ∥ \geq ε \forall ℓ \geq ℓ^{*} .

∥ x_{k_{ℓ}^{''}} - x_{k_{ℓ}^{''} - 1} ∥ \geq ∥ x_{k_{ℓ}^{''}} - x^{*} ∥ - ∥ x^{*} - x_{k_{ℓ}^{''} - 1} ∥ \geq ε \forall ℓ \geq ℓ^{*} .

k \to \infty lim ∥ x_{k} - z ∥^{2} \leq ∥ x_{0} - z ∥^{2} - k \to \infty lim j = 0 \sum k - 1 ∥ x_{j + 1} - x_{j} ∥^{2} = - \infty,

k \to \infty lim ∥ x_{k} - z ∥^{2} \leq ∥ x_{0} - z ∥^{2} - k \to \infty lim j = 0 \sum k - 1 ∥ x_{j + 1} - x_{j} ∥^{2} = - \infty,

k \to \infty lim m, n \in {1, \dots, N} max D_{m, n}^{k} = 0,

k \to \infty lim m, n \in {1, \dots, N} max D_{m, n}^{k} = 0,

# {k \in \mathbbm N : j_{k} = m} = \infty \forall m \in {1, \dots, \mathbbm N} .

0 \leq ∥ x_{k} - z ∥^{2} \leq ∥ x_{0} - z ∥^{2} - j = 0 \sum k - 1 ∥ x_{j + 1} - x_{j} ∥^{2} \forall k \in \mathbbm N,

0 \leq ∥ x_{k} - z ∥^{2} \leq ∥ x_{0} - z ∥^{2} - j = 0 \sum k - 1 ∥ x_{j + 1} - x_{j} ∥^{2} \forall k \in \mathbbm N,

k \to \infty lim ∥ x_{k + 1} - x_{k} ∥ = 0.

k \to \infty lim ∥ x_{k + 1} - x_{k} ∥ = 0.

J_{\infty} := {m \in {1, \dots, N} : # {k \in \mathbbm N : j_{k} = m} = \infty} .

J_{\infty} := {m \in {1, \dots, N} : # {k \in \mathbbm N : j_{k} = m} = \infty} .

∥ x_{k + 1} - x_{k} ∥ \leq ε for all k \geq k^{*} .

∥ x_{k + 1} - x_{k} ∥ \leq ε for all k \geq k^{*} .

n max D_{m, n}^{k^{'}} \leq max {ε, n max D_{m, n}^{k}} whenever k^{*} \leq k \leq k^{'} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A learning-enhanced projection method for solving convex feasibility

problems

Janosch Rieger

Abstract

We propose a generalization of the method of cyclic projections, which uses the lengths of projection steps carried out in the past to learn about the geometry of the problem and decides on this basis which projections to carry out in the future. We prove the convergence of this algorithm and illustrate its behavior in a first numerical study.

MSC Codes: 65H20, 52B55, 37B20, 90C59

Keywords: Method of cyclic projections, acceleration of convergence, convex feasibility problem

1 Introduction

The method of cyclic projections, originally proposed in [7], is an established numerical algorithm, which computes a point in the intersection of finitely many closed convex subsets of a Hilbert space when this intersection is nonempty. A broad overview over convergence properties of this method as well as the underlying theory is given in [3], [4], [10] and the references therein.

Estimates for the speed of convergence of the method of cyclic projections are well-known in the case when the sets are affine linear subspaces. For this situation, accelerated variants of the original scheme, which are often based on line-search ideas, have been developed, see e.g. [5] and [13]. Recently, a first result on the speed of convergence of the method of cyclic projections has been given in the case of semi-algebraic sets, see [6]. In general, however, the method can be arbitrarily slow, see [12] for a pathological example.

When the sets are affine linear subspaces with codimension 1, the method of cyclic projections reduces to the Kaczmarz method, see [16], which has gained popularity in the context of very large, but sparse consistent linear systems, see [8]. A probabilistic version of this algorithm, which converges exponentially in expectation, has been introduced in [17], and an accelerated version of this method has been proposed in [2]. The Kaczmarz method is frequently used in medical imaging, see [15], where block and column action strategies have become a topic of active research interest [1] and [11].

The numerical method presented in this paper is supposed to accelerate the method of cyclic projections in settings where the above-mentioned refined algorithms for subspaces are not applicable. The guiding idea behind the method is to gather as much information on the relative geometry of the closed convex sets from the lengths of the projection steps carried out in the past. This is motivated by the convergence proof in [7], which reveals that the performance of the algorithm is in worst case determined by the lengths of the projection steps carried out.

We prove that our method converges, using techniques which are common in the dynamical systems community. The main challenge is to guarantee convergence for a reasonably broad class of strategies our basic algorithm can be equipped with. As it seems very hard to quantify a speed of convergence even in the subspace case, we provide several numerical studies performed on a toy example, which provide some insight as to why and how our method can outperform the standard methods of cyclic and random projections.

2 The algorithm

Given closed convex sets $C_{1},\ldots,C_{N}\subset\mathbbm{R}^{d}$ with $C:=\cap_{j=1}^{N}C_{j}\neq\emptyset$ , we wish to find a point $x^{*}\in C$ . We first present two common projection algorithms for solving this problem in Section 2.1. Then we propose a new projection algorithm in Section 2.2, which learns the geometry of the problem to some extent from the lengths of the projection steps carried out in the past and uses this knowledge to select favourable projections in the future.

The notation used in this paper is mostly standard. Given a point $x\in\mathbbm{R}^{d}$ and a closed convex set $C\subset\mathbbm{R}^{d}$ , it is well-known that the projection

[TABLE]

of $x$ to $C$ exists and is a unique point.

By $\mathrm{randperm}(1,\ldots,N)$ we denote a permutation of the numbers $1,\ldots,N$ which is sampled uniformly from the set of all such permutations, and by $\mathrm{urs}(I)$ , we denote a uniform random sample from an index set $I\subset\{1,\ldots,N\}$ .

2.1 The benchmark: MCP and MRP

The now classical method of cyclic projections, which was originally published in [7], approximates a point $x^{*}\in C$ by iteratively projecting to the sets $C_{1},\ldots,C_{N}$ in a cyclic fashion, see Algorithm 1.

Algorithm 1 may converge very slowly when many of the projection steps are small. This behavior may originate from an unfavorable ordering of the sets $C_{1},\ldots,C_{N}$ , which can be helped by randomly shuffling the order of the sets in every cycle, see Algorithm 2.

The random Kaczmarz method proposed in [17] is a prominent variant of Algorithm 2 in the framework of row-action methods for solving linear systems, which is known to converge in expectation. Since MRP slightly outperformed the random Kaczmarz method in all examples we have studied, we use MRP as the benchmark for randomized algorithms.

2.2 A projection algorithm with learning ability

The idea behind Algorithm 3 (PAM) is to keep a record of the lengths of projection steps performed in the past and to give preference to operations that have lead to large projection steps. This enables our algorithm to learn to some extent the geometry of the problem with manageable additional computational cost.

From a formalistic point of view, our approach resembles to some extent the techniques of loping and flagging introduced in [11] in the setting of row-action methods. These techniques suppress the effect of noise in the data on MCP by ignoring projections which had very small residuals in previous cycles. From a phenomenological perspective, however, these modifications of MCP do not have much in common with PAM.

In the following, we give an intuitive description how some of the individual components interact in Algorithm 3. They will be treated with proper mathematical rigour in the next section.

i)

The sequence of matrices $(D^{k})_{k\in\mathbbm{N}}$ records – up to the impact of the function $\varphi$ – the length of the $k$ -th projection step from set $C_{j_{k}}$ to $C_{j_{k+1}}$ in the component $D^{k+1}_{j_{k},j_{k+1}}$ .

ii)

The input $D^{0}$ has three distinct effects.

a)

If $D^{0}_{m,n}=0$ , a transition from $C_{m}$ to $C_{n}$ will not occur during the entire runtime of the algorithm, see Lemma 5. Thus, by choosing a sparse $D^{0}$ as in Example 2(i), one can limit the amount of information that needs to be stored and processed at runtime. For a graphic illustration, see Example 12, and for the impact on performance in the context of a toy model, see Example 13.

b)

The choice of $D^{0}$ can incorporate a priori knowledge: The more likely a transition from $C_{m}$ to $C_{n}$ is to be beneficial, the larger the entry $D^{0}_{m,n}$ should be chosen, see Example 2(ii).

c)

If the entries of $D^{0}$ are small relative to the first several lengths $\|x_{k+1}-x_{k}\|$ of steps to be carried out, the algorithm will not perform well in an initial stage, see Example 11(ii). If they are larger, the algorithm will initially behave like MRP, see Example 11(i).

iii)

The function $\varphi$ modifies the step length before it is recorded in the matrix $D^{k+1}$ .

a)

It ensures that only strictly positive values are written into $D^{k+1}$ .

b)

It determines at what level of overall performance a transition from $C_{m}$ to $C_{n}$ will get reactivated after it generated a short step.

Finally, we would like to mention that we represent the recorded step-lengths in a matrix $D^{k}$ to keep the notation manageable. Depending on the size of the problem, the sparsity pattern of $D^{0}$ and the policy $\varphi$ , it can be beneficial to use a different data structure – such as one red-black tree per set $C_{j}$ – to store and search this data with moderate on-cost compared to MCP and MRP.

3 Admissible input

For Algorithm 3 to converge, we require the inputs to have certain properties. The matrix $D^{0}$ is required to be irreducible in the following sense.

Definition 1 (admissible matrix).

A matrix $D\in\mathbbm{R}^{N\times N}_{\geq 0}$ is called admissible if it satisfies

i)

$D_{m,m}=0$ * for all $m\in\{1,\ldots,N\}$ , and*

ii)

for any indices $m,n\in\{1,\ldots,N\}$ with $m\neq n$ , there exist some $\ell\in\mathbbm{N}$ and indices $i_{1},\ldots,i_{\ell}\in\{1,\ldots,N\}$ such that

[TABLE]

We give a few examples how the matrix $D^{0}$ can be chosen.

Example 2 (some admissible matrices).

For a good performance of Algorithm 3, it is helpful to multiply the matrices proposed below with a positive scalar to ensure that their respective nonzero entries are – at least on average and for small $k$ – similar to or larger than the length $\|x_{k+1}-x_{k}\|$ of the $k$ -th projection step from set $C_{j_{k}}$ to $C_{j_{k+1}}$ .

i) To limit the effective size of the matrices $D^{k}$ , one can choose $D^{0}$ to be a sparse matrix such as the banded matrices $D^{\leftrightarrow}\in\mathbbm{R}^{N\times N}$ given by

[TABLE]

and $D^{\rightarrow}\in\mathbbm{R}^{N\times N}$ given by

[TABLE]

with some $\omega\in\mathbbm{N}$ with $1\leq\omega\ll N$ .

ii) In scenarios, where the concept of an angle makes sense, it is reasonable to work with a matrix $D^{\angle}\in\mathbbm{R}^{N\times N}$ given by

[TABLE]

to introduce a bias in favour of transitions with large angles, which are more likely to result in large step-lengths.

It is easy to check that the above matrices are admissible. Please note that MCP is a special case of Algorithm 3, which can be realized by choosing $D^{0}$ to be the matrix $D^{\rightarrow}$ with $\omega=1$ .

The function $\varphi$ is required to be strictly positive on all meaningful input, and it must ensure a certain decay of the entries of $D$ .

Definition 3 (admissible policies).

A function

[TABLE]

is called an admissible policy if there exists $\beta\in(0,1)$ such that

i)

$\varphi(m,D)>0$ * holds for all $m\in\{1,\ldots,N\}$ and $D\in\mathbbm{R}^{N\times N}_{\geq 0}$ satisfying $\max_{n}D_{m,n}\neq 0$ , and*

ii)

$\varphi(m,D)\leq\beta\max_{n}D_{m,n}$ * for all $m\in\{1,\ldots,N\}$ and $D\in\mathbbm{R}^{N\times N}_{\geq 0}$ .*

We propose some particular policies $\varphi$ .

Example 4 (some admissible policies).

It is easy to check that both policies proposed below are indeed admissible for every $\beta\in(0,1)$ .

i) The function

[TABLE]

ensures that the number which is written into the distance matrix $D^{k+1}$ in line 7 of Algorithm 3 is at least $\beta$ times the minimal previously recorded step-length from $C_{m}$ to another $C_{n}$ .

ii) The function

[TABLE]

ensures that the number which is written into $D^{k+1}$ in line 7 of Algorithm 3 is at least $\beta$ times the average of the previously recorded step-lengths from $C_{m}$ to another $C_{n}$ .

Note that the value of the minimal nonzero entry in a row as well as the average of the nonzero entries in a row can be updated with negligible computational cost in every step.

The proof of the following statement is elementary.

Lemma 5 (preservation of sparsity pattern).

Let both $D^{0}\in\mathbbm{R}^{N\times N}_{\geq 0}$ and $\varphi:\{1,\ldots,N\}\times\mathbbm{R}^{N\times N}_{\geq 0}\to\mathbbm{R}_{\geq 0}$ be admissible, and let $(D^{k})_{k\in\mathbbm{N}}\in(\mathbbm{R}^{N\times N}_{\geq 0})^{\mathbbm{N}}$ be the matrices generated by Algorithm 3 with arbitrary initial value $x\in\mathbbm{R}^{d}$ . Then for any $k\in\mathbbm{N}$ and $m,n\in\{1,\ldots,N\}$ , we have

[TABLE]

and, in particular, the matrices $D^{k}$ are admissible for all $k\in\mathbbm{N}$ .

4 Convergence analysis

We first prove a general principle for projection algorithms in Section 4.1. Then we show in Section 4.2 that Algorithm 3 satisfies the assumptions of this statement.

4.1 Recurrence implies convergence

We restate a slightly modified version of Corollaries 1 and 2 from [7].

Lemma 6 (projections reduce error).

Let $C_{1},\ldots,C_{N}\subset\mathbbm{R}^{d}$ be closed convex sets and $z\in\cap_{j=1}^{N}C_{j}$ , and let the sequences $(j_{k})_{k\in\mathbbm{N}}\in\{1,\ldots,N\}^{\mathbbm{N}}$ and $(x_{k})_{k\in\mathbbm{N}}\in(\mathbbm{R}^{d})^{\mathbbm{N}}$ satisfy

[TABLE]

Then we have

[TABLE]

Now we show that every projection algorithm, which projects to every set $C_{j}$ infinitely often, generates a sequence that converges to a point in $C$ . Related results are known in the community working on firmly nonexpansive operators, see Theorem 4.1 from [9]. We include an explicit statement of this fact and an elementary proof to keep the paper self-contained.

Proposition 7 (recurrence implies convergence).

Let $C_{1},\ldots,C_{N}\subset\mathbbm{R}^{d}$ be closed convex sets with $\cap_{j=1}^{N}C_{j}\neq\emptyset$ , and let $(j_{k})_{k\in\mathbbm{N}}\in\{1,\ldots,N\}^{\mathbbm{N}}$ and $(x_{k})_{k\in\mathbbm{N}}\in(\mathbbm{R}^{d})^{\mathbbm{N}}$ be sequences which satisfy

[TABLE]

as well as the recurrence condition

[TABLE]

Then there exists $x^{*}\in\cap_{j=1}^{N}C_{j}$ such that $\lim_{k\to\infty}x_{k}=x^{*}$ .

Proof.

Because of statement (2) of Lemma 6, there exist a subsequence $(x_{k_{\ell}})_{\ell\in\mathbbm{N}}$ of $(x_{k})_{k\in\mathbbm{N}}$ and $x^{*}\in\mathbbm{R}^{d}$ such that

[TABLE]

Clearly, there exist $j^{*}\in\{1,\ldots,N\}$ and a subsequence $(k_{\ell_{m}})_{m\in\mathbbm{N}}$ of the sequence $(k_{\ell})_{\ell\in\mathbbm{N}}$ with

[TABLE]

Since $C_{j^{*}}$ is closed, we have $x^{*}\in C_{j^{*}}$ . We partition $\{1,\ldots,N\}$ into

[TABLE]

By the above, we have $J^{*}\neq\emptyset$ . Assume that $J_{*}\neq\emptyset$ . By induction, using statement (3), we can construct sequences $(k^{\prime}_{\ell})_{\ell\in\mathbbm{N}}\in\mathbbm{N}^{\mathbbm{N}}$ and $(k^{\prime\prime}_{\ell})_{\ell\in\mathbbm{N}}\in\mathbbm{N}^{\mathbbm{N}}$ given by $k^{\prime}_{0}:=k_{\ell_{0}}$ and the iteration

[TABLE]

for $\ell\in\mathbbm{N}$ . In particular, we have

[TABLE]

Since $J_{*}$ is finite, there exists $\varepsilon>0$ such that

[TABLE]

By construction of the sequence $(k_{\ell}^{\prime})_{\ell\in\mathbbm{N}}$ , there exists $\ell^{*}\in\mathbbm{N}$ such that

[TABLE]

Applying statement (2) of Lemma 6 with $z=x^{*}$ and the system of sets $\{C_{j}:j\in J^{*}\}$ , and using the construction of the sequence $(k_{\ell}^{\prime\prime})_{\ell\in\mathbbm{N}}$ , we obtain

[TABLE]

On the other hand, we have $\|x_{k_{\ell}^{\prime\prime}}-x^{*}\|\geq 2\varepsilon$ for all $\ell\in\mathbbm{N}$ , so

[TABLE]

Now let $z\in\cap_{j=1}^{N}C_{j}$ and use statement (5) and statement (1) from Lemma 6 multiple times to obtain

[TABLE]

which is a contradiction. Hence $J_{*}=\emptyset$ , and $x^{*}\in\cap_{j=1}^{N}C_{j}$ . Now statements (4) and statement (2) of Lemma 6 with $z=x^{*}$ imply $\lim_{k\to\infty}x_{k}=x^{*}$ , as desired. ∎

4.2 Convergence of PAM

We check that Algorithm 3 satisfies the assumptions of Proposition 7, whenever the matrix $D^{0}$ and the policy $\varphi$ are admissible.

Proposition 8 (PAM is recurrent).

Let $C_{1},\ldots,C_{N}\subset\mathbbm{R}^{d}$ be closed convex sets which satisfy $\cap_{j=1}^{N}C_{j}\neq\emptyset$ , and let the matrix $D^{0}\in\mathbbm{R}^{N\times N}_{\geq 0}$ and the policy $\varphi:\{1,\ldots,N\}\times\mathbbm{R}^{N\times N}_{\geq 0}\to\mathbbm{R}_{\geq 0}$ be admissible. Then for any initial point $x_{0}\in C_{1}$ , the sequences $(j_{k})_{k\in\mathbbm{N}}\in\{1,\ldots,N\}^{\mathbbm{N}}$ and $(D^{k})_{k\in\mathbbm{N}}\in(\mathbbm{R}_{\geq 0}^{N\times N})^{\mathbbm{N}}$ generated by Algorithm 3 satisfy

[TABLE]

Proof.

Let $z\in\cap_{j=1}^{N}C_{j}$ . Applying inequality (1) from Lemma 6 multiple times yields

[TABLE]

which forces

[TABLE]

Let us denote

[TABLE]

Obviously, we have $J_{\infty}\neq\emptyset$ . Let $m\in J_{\infty}$ , and let $(k_{\ell})_{\ell\in\mathbbm{N}}\in\mathbbm{N}^{\mathbbm{N}}$ be the maximal strictly increasing sequence with $j_{k_{\ell}}=m$ for all $\ell\in\mathbbm{N}$ . Let $\varepsilon>0$ . By statement (8), there exists $k^{*}\in\mathbbm{N}$ such that

[TABLE]

Because of line 6 of Algorithm 3 and since $\varphi$ is admissible with a decay rate $\beta\in(0,1)$ , we have

[TABLE]

Now let $\ell\in\mathbbm{N}$ be such that $k_{\ell}\geq k^{*}$ . We wish to show that

[TABLE]

To this end, we introduce the quantity

[TABLE]

and prove the statement

[TABLE]

by induction. Statement (11) is trivial for $p=0$ . Assume that statement (11) holds for some $p\in\{0,\ldots,N-1\}$ . If $\nu(p)=0$ , then statement (9) implies that $\nu(p+1)=0$ , and that the induction hypothesis (11) holds for $p+1$ . If $\nu(p)>0$ , then line 3 of Algorithm 3 selects an index

[TABLE]

that satisfies

[TABLE]

By line 6 of Algorithm 3 and by statemenr (9), we have

[TABLE]

By construction of the sequence $(k_{\ell})_{\ell\in\mathbbm{N}}$ , it follows that

[TABLE]

so $\nu(p+1)=\nu(p)-1$ , and statement (11) holds for $p+1$ . This completes the induction, and statement (10) is verified, because $\nu(N)=0$ . Since $\beta<1$ , statements (9) and (10) imply that there exists $k^{**}\in\mathbbm{N}$ such that $\max_{n}D^{k}_{m,n}\leq\varepsilon$ for all $k\geq k^{**}$ . Since $m\in J_{\infty}$ and $\varepsilon>0$ were arbitrary, we have shown that

[TABLE]

In view of Lemma 5, this implies

[TABLE]

Since $D^{0}$ satisfies part ii) of Definition 1, a simple recursion on statement (13) yields $J_{\infty}=\{1,\ldots,N\}$ , which is statement (7). Consequently, statement (12) implies (6). ∎

Now we summarize the above in the main theoretical result of this paper.

Theorem 9 (convergence of PAM).

Let $C_{1},\ldots,C_{N}\subset\mathbbm{R}^{d}$ be closed convex sets which satisfy $\cap_{j=1}^{N}C_{j}\neq\emptyset$ , and let the matrix $D^{0}\in\mathbbm{R}^{N\times N}_{\geq 0}$ and the policy $\varphi:\{1,\ldots,N\}\times\mathbbm{R}^{N\times N}_{\geq 0}\to\mathbbm{R}_{\geq 0}$ be admissible. Then there exists $x^{*}\in\cap_{j=1}^{N}C_{j}$ such that the sequence $(x_{k})_{k\in\mathbbm{N}}\in(\mathbbm{R}^{d})^{\mathbbm{N}}$ generated by Algorithm 3 satisfies

[TABLE]

Proof.

Proposition 8 verifies that Algorithm 3 satisfies the assumptions of Proposition 7, so PAM is indeed convergent. ∎

5 An instructive toy example

We explore the performance of PAM with different matrices $D^{0}$ and policies $\varphi$ in a very simple toy example, and compare its behavior with MCP and MRP. We are fully aware that this example has many unrealistic features, but it allows us to illustrate key features of our algorithm in a nice graphic way. To keep things simple, we measure the computational cost of all three algorithms in the number of iterations, which is the number of projection steps carried out.

Throughout this section, we consider the one-dimensional subspaces

[TABLE]

with $\cap_{j=1}^{N}C_{j}=\{0\}$ , and the initial point $x_{0}=(\cos(\frac{\pi}{N}),\sin(\frac{\pi}{N}),1)$ . For aesthetical reasons, we choose $N=9$ and $r=0.05$ in most illustrations.

Let us first compare MCP, MRP and PAM without going into too much technical detail.

Example 10 (benchmark versus PAM).

In Figure 1, we see at a glance how the strategies behind MCP, MRP and PAM impact their behavior and performance when applied to the toy model. The fixed order of projections in MCP can result in significant underperformance, while the random order of the projections in MRP guarantees that the average of the achievable progress is realized.

This motivates us to try and outperform the average by assigning a high probability to transitions, which performed better than average in previous iterations, in the new method PAM. The toy problem suggests that this is not a bad idea, when the matrix $D^{0}$ and the policy $\varphi$ are chosen well for the problem at hand. For this showcase, we used $D^{0}$ and $\varphi$ as in Example 11a), and carried out 315 iterations with each method.

In Example 11, we examine how a good performance of PAM can be achieved by a proper scaling of the initial matrix $D^{0}$ . Please note that the intention of this example is not to discuss the size of numerical errors, but rather the qualitative behavior of PAM. The matrices and the iteration numbers are chosen in such a way that these characteristics become clearly visible.

Example 11 (scaling $D^{0}$ ).

i) In Figure 2, we apply PAM with $\varphi_{\min}$ , $\beta=0.01$ and initial matrix $D^{0}\in\mathbbm{R}^{N\times N}_{\geq 0}$ given by

[TABLE]

While the entries of the matrix $D^{k}$ are large compared to the actual step-sizes of the algorithm, we see a more or less uniform sampling of the transitions, similar to the behavior of the superior benchmark method MRP. Once the sizes of the entries of the matrix $D^{k}$ are similar to the sizes of the steps carried out, PAM has learned the geometry of the problem and focusses with high probability on profitable transitions, which allows it to outperform MRP.

ii) In Figure 3, we apply PAM with $\varphi_{\min}$ , $\beta=0.01$ and initial matrix $D^{0}\in\mathbbm{R}^{N\times N}_{\geq 0}$ given by

[TABLE]

so the entries of the matrices $D^{k}$ underestimate the actual step-sizes in the initial phase of the algorithm. This leads to unpredictable qualitative behaviour of PAM and incomplete exploration of the admissible transitions, and in many cases to an underperformance relative to MRP. Once the sizes of the entries of the matrix $D^{k}$ are similar to the sizes of the steps carried out, the qualitative behavior will be as in part i) above.

The effect of choosing a sparse $D^{0}$ is not surprising.

Example 12 (sparse $D^{0}$ ).

We apply PAM to the model problem with $\varphi_{\min}$ , $\beta=0.01$ and the matrix $D^{\rightarrow}$ from Example 2 with parameters $\omega=2,4,6$ , and obtain the results shown in Figure 4 after 432 iterations. Note that the matrix $D^{\rightarrow}$ overestimates the first step-lengths of the algorithm and therefore needs no scaling.

The algorithm behaves exactly as expected: After an initial learning phase, PAM focusses on the most profitable admissible transitions. A small bandwidth $\omega$ results in a shorter initial learning phase, but small gain in long-term performance as compared to MCP. On the other hand, a large $\omega$ results in a longer learning phase with a seizable long-term gain in performance.

Our toy model is not sophisticated enough to reveal a significant difference between the behavior induced by different policies $\varphi$ . We can, however, observe how the choice of the bandwidth of the matrix $D^{\rightarrow}$ from Example 2 impacts the performance of the method in this particular example.

Example 13 (first quantitative tests in toy example).

We apply MCP, MRP and PAM with initial matrix $D^{\rightarrow}$ from Example 2 and three different choices of the bandwidth $\omega$ to our toy problem. There are a few interesting features of the results displayed in Figure 5 we wish to summarize:

i)

The initial learning phase in which PAM explores the geometry of the problem is clearly visible in the error plot.

ii)

When $\omega=N$ , i.e. when every transition from set $C_{m}$ to set $C_{n}$ with $m\neq n$ is admissible, PAM never performed worse than MRP.

iii)

The harder the problem is to solve for MCP and MRP (in this example this is the case when $r>0$ is small), the more clearly PAM (with large $\omega$ ) outperforms both methods.

6 Conclusion

This paper introduces the idea of learning to the realm of algorithms for feasibility problems. The focus is on establishing a first feasible algorithm and proving its convergence for a range of admissible learning strategies. Since it was a major effort and achievement to quantify the speed of convergence for MCP and MRP in the setting of affine subspaces, it seems impossible to achieve something similar for PAM, which is, in a sense, path-dependent. For this reason, we believe that we completed the theoretical analysis of PAM in the present paper.

First experiments with PAM applied to computerized and seismic tomography data reveal that the performance of PAM varies between different types of problems. The choice of the matrix $D^{0}$ and the strategy $\varphi$ really seems to matter in a real-world context, which calls for a detailed computational investigation of the performance of PAM. As the numerical handling of these problems is a challenge in itself, and unrelated to the key issue of the present paper, we postpone a detailed exploration of this issue to future work.

Acknowledgement

The author thanks Matthew Tam for an introduction to the world of projection methods and support during the preparation of this paper.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Arroyo, E. Arroyo, X. Li, and J. Zhu. The convergence of the block cyclic projection with an overrelaxation parameter for compressed sensing based tomography. J. Comput. Appl. Math. , 280:59–67, 2015.
2[2] Z. Bai and W. Wu. On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. , 40(1):A 592–A 606, 2018.
3[3] H.H. Bauschke and J.M. Borwein. On projection algorithms for solving convex feasibility problems. SIAM Rev. , 38(3):367–426, 1996.
4[4] H.H. Bauschke and P.L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
5[5] H.H. Bauschke, F. Deutsch, H. Hundal, and S.-H. Park. Accelerating the convergence of the method of alternating projections. Trans. Amer. Math. Soc. , 355(9):3433–3461, 2003.
6[6] J.M. Borwein, G. Li, and L. Yao. Analysis of the convergence rate for the cyclic projection algorithm applied to basic semialgebraic convex sets. SIAM J. Optim. , 24(1):498–527, 2014.
7[7] L.M. Brègman. Finding the common point of convex sets by the method of successive projection. Dokl. Akad. Nauk SSSR , 162:487–490, 1965.
8[8] Y. Censor. Row-action methods for huge and sparse systems and their applications. SIAM Rev. , 23(4):444–466, 1981.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A learning-enhanced projection method for solving convex feasibility

Abstract

1 Introduction

2 The algorithm

2.1 The benchmark: MCP and MRP

2.2 A projection algorithm with learning ability

3 Admissible input

Definition 1** (admissible matrix).**

Example 2** (some admissible matrices).**

Definition 3** (admissible policies).**

Example 4** (some admissible policies).**

Lemma 5** (preservation of sparsity pattern).**

4 Convergence analysis

4.1 Recurrence implies convergence

Lemma 6** (projections reduce error).**

Proposition 7** (recurrence implies convergence).**

Proof.

4.2 Convergence of PAM

Proposition 8** (PAM is recurrent).**

Proof.

Theorem 9** (convergence of PAM).**

Proof.

5 An instructive toy example

Example 10** (benchmark versus PAM).**

Example 11** (scaling D0D^{0}D0).**

Example 12** (sparse D0D^{0}D0).**

Example 13** (first quantitative tests in toy example).**

6 Conclusion

Acknowledgement

Definition 1 (admissible matrix).

Example 2 (some admissible matrices).

Definition 3 (admissible policies).

Example 4 (some admissible policies).

Lemma 5 (preservation of sparsity pattern).

Lemma 6 (projections reduce error).

Proposition 7 (recurrence implies convergence).

Proposition 8 (PAM is recurrent).

Theorem 9 (convergence of PAM).

Example 10 (benchmark versus PAM).

Example 11 (scaling $D^{0}$ ).

Example 12 (sparse $D^{0}$ ).

Example 13 (first quantitative tests in toy example).