Residual Expansion Algorithm: Fast and Effective Optimization for   Nonconvex Least Squares Problems

Daiki Ikami; Toshihiko Yamasaki; Kiyoharu Aizawa

arXiv:1705.09549·cs.CV·May 29, 2017

Residual Expansion Algorithm: Fast and Effective Optimization for Nonconvex Least Squares Problems

Daiki Ikami, Toshihiko Yamasaki, Kiyoharu Aizawa

PDF

Open Access

TL;DR

The Residual Expansion (RE) algorithm offers a fast, effective, and easy-to-implement method for near-global optimization in nonconvex least squares problems, outperforming traditional stochastic approaches.

Contribution

It introduces the RE algorithm, a novel deterministic optimization technique that achieves fast, near-global solutions for nonconvex least squares problems, suitable for high-dimensional tasks.

Findings

01

Excellent empirical performance in k-means clustering

02

Effective in point-set registration and image deblurring

03

Outperforms stochastic methods in speed and accuracy

Abstract

We propose the residual expansion (RE) algorithm: a global (or near-global) optimization method for nonconvex least squares problems. Unlike most existing nonconvex optimization techniques, the RE algorithm is not based on either stochastic or multi-point searches; therefore, it can achieve fast global optimization. Moreover, the RE algorithm is easy to implement and successful in high-dimensional optimization. The RE algorithm exhibits excellent empirical performance in terms of k-means clustering, point-set registration, optimized product quantization, and blind image deblurring.

Tables11

Table 1. Table 1: Clustering results on synthetic data. Mean relative errors of the RE algorithm with different μ ( 0 ) superscript 𝜇 0 \mu^{(0)} and N 𝑁 N are reported.

	$μ^{(0)} = 0.5$	$μ^{(0)} = 0.2$	$μ^{(0)} = 0.1$	$μ^{(0)} = 0.01$
$T = 30$	0.905	0.894	0.902	0.921
$T = 100$	0.854	0.856	0.862	0.876
$T = 300$	0.843	0.844	0.846	0.854
$T = 1, 000$	0.837	0.839	0.840	0.843

Table 2. (a) Synthetic data A with k = 100 𝑘 100 k=100 .

	$μ^{(0)} = 0.5$	$μ^{(0)} = 0.2$	$μ^{(0)} = 0.1$	$μ^{(0)} = 0.01$
$T = 30$	0.905	0.894	0.902	0.921
$T = 100$	0.854	0.856	0.862	0.876
$T = 300$	0.843	0.844	0.846	0.854
$T = 1, 000$	0.837	0.839	0.840	0.843

Table 3. (b) Synthetic data B with k = 10 𝑘 10 k=10 .

	$μ^{(0)} = 0.5$	$μ^{(0)} = 0.2$	$μ^{(0)} = 0.1$	$μ^{(0)} = 0.01$
$T = 30$	2.699	1.758	1.493	0.998
$T = 100$	2.784	1.209	0.789	0.630
$T = 300$	2.722	1.036	0.708	0.552
$T = 1, 000$	2.749	0.994	0.630	0.552

Table 4. Table 2: Clustering results on synthetic data. The mean / min / max relative errors and the average elapsed time are reported.

		Relative error			Elapsed time [sec]
		Mean	Min	Max	Elapsed time [sec]
Random seeding		1.246	1.102	1.473	0.058
k-means++ [2]		1.000	0.944	1.081	0.244
Hartigan’s algorithm [10]		0.925	0.881	0.981	0.359
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	0.902	0.875	0.942	0.258
	$T = 100$	0.862	0.846	0.873	0.780
	$T = 300$	0.846	0.836	0.856	2.29
	$T = 1, 000$	0.840	0.831	0.850	7.61

Table 5. (a) Synthetic data A with k = 100 𝑘 100 k=100 .

		Relative error			Elapsed time [sec]
		Mean	Min	Max	Elapsed time [sec]
Random seeding		1.246	1.102	1.473	0.058
k-means++ [2]		1.000	0.944	1.081	0.244
Hartigan’s algorithm [10]		0.925	0.881	0.981	0.359
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	0.902	0.875	0.942	0.258
	$T = 100$	0.862	0.846	0.873	0.780
	$T = 300$	0.846	0.836	0.856	2.29
	$T = 1, 000$	0.840	0.831	0.850	7.61

Table 6. (b) Synthetic data B with k = 10 𝑘 10 k=10 .

		relative error			elapsed time [sec]
		mean	min	max	elapsed time [sec]
Random seeding		4.277	0.552	22.680	0.0194
k-means++ [2]		1.000	0.552	8.743	0.0271
Hartigan’s algorithm [10]		1.000	0.552	8.743	0.0429
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	1.493	0.552	6.777	0.0699
	$T = 100$	0.789	0.552	2.500	0.176
	$T = 300$	0.708	0.552	2.500	0.473
	$T = 1, 000$	0.630	0.552	2.500	1.51

Table 7. Table 3: Clustering results on real data. The mean / min / max of the relative error and the average elapsed time are reported.

		Relative error			Elapsed time [sec]
		Mean	Min	Max	Elapsed time [sec]
Random seeding		1.255	1.003	1.438	0.0444
k-means++ [2]		1.000	0.920	1.097	0.0395
Hartigan’s algorithm [10]		0.994	0.920	1.093	0.0593
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	0.980	0.920	1.031	0.0719
	$T = 100$	0.941	0.920	0.986	0.183
	$T = 300$	0.926	0.920	0.983	0.516
	$T = 1, 000$	0.920	0.920	0.921	1.63

Table 8. (a) Cloud dataset ( 𝐗 ∈ 𝐑 10 × 1024 𝐗 superscript 𝐑 10 1024 \boldsymbol{\mathbf{X}}\in\mathbf{R}^{10\times 1024} ) with k = 10 𝑘 10 k=10 .

		Relative error			Elapsed time [sec]
		Mean	Min	Max	Elapsed time [sec]
Random seeding		1.255	1.003	1.438	0.0444
k-means++ [2]		1.000	0.920	1.097	0.0395
Hartigan’s algorithm [10]		0.994	0.920	1.093	0.0593
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	0.980	0.920	1.031	0.0719
	$T = 100$	0.941	0.920	0.986	0.183
	$T = 300$	0.926	0.920	0.983	0.516
	$T = 1, 000$	0.920	0.920	0.921	1.63

Table 9. (b) COIL20 dataset ( 𝐗 ∈ 𝐑 1024 × 1440 𝐗 superscript 𝐑 1024 1440 \boldsymbol{\mathbf{X}}\in\mathbf{R}^{1024\times 1440} ) with k = 20 𝑘 20 k=20 .

		Relative error			Elapsed time [sec]
		Mean	Min	Max	Elapsed time [sec]
Random seeding		0.999	0.953	1.076	2.22
k-means++ [2]		1.000	0.962	1.038	3.52
Hartigan’s algorithm [10]		0.990	0.960	1.021	8.07
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	0.951	0.939	0.977	4.37
	$T = 100$	0.945	0.938	0.960	12.6
	$T = 300$	0.942	0.938	0.950	36.6
	$T = 1, 000$	0.941	0.938	0.956	119

Table 10. Table 4: Point set registration results. We reported the number of successes with each different rotation angle. We also reported the average elapsed time for all 150 point sets.

		Number of successes			Number of iterations
		$ϕ = π / 3$	$ϕ = 5 π / 12$	$ϕ = π / 2$	Number of iterations
ICP algorithm		26	4	1	46.7
RE algorithm ( $μ^{(0)} = 0.1$ )	$T = 30$	46	25	2	47.4
	$T = 100$	49	31	5	110.1
	$T = 300$	49	33	6	310.2
	$T = 1, 000$	49	36	6	1008

Table 11. Table 5: Deblurring results. We reported PSNR values [dB] for each method. The best and second-best results are highlighted in bold and italics, respectively.

Image	#1				#2				#3				#4				#5
Kernel	#1	#2	#3	#4	#1	#2	#3	#4	#1	#2	#3	#4	#1	#2	#3	#4	#1	#2	#3	#4
Pan et al. [23]	13.6	13.2	20.3	13.1	14.9	16.5	14.1	11.2	11.8	20.6	19.4	10.0	13.7	11.8	19.1	11.9	19.6	19.2	16.8	14.1
Alg. 1	20.2	20.2	21.3	16.4	17.0	16.9	15.8	7.4	20.9	18.8	20.4	6.5	23.6	20.4	19.7	12.5	21.3	18.2	17.5	9.4
Alg. 2	20.2	19.6	20.5	15.7	17.2	17.1	15.8	8.4	19.5	21.5	20.7	14.2	23.6	20.5	19.1	11.9	21.9	17.7	18.6	15.2

Equations52

C, Z min \frac{1}{2} ∥ X - C Z ∥_{F}^{2}

C, Z min \frac{1}{2} ∥ X - C Z ∥_{F}^{2}

\mbox s . t . z_{ij} = {0, 1}, ∥ z_{i} ∥_{1} = 1,

R, t, Z min \frac{1}{2} ∥ R X + t 1^{⊤} - Y Z ∥_{F}^{2}

R, t, Z min \frac{1}{2} ∥ R X + t 1^{⊤} - Y Z ∥_{F}^{2}

\mbox s . t . z_{ij} = {0, 1}, ∥ z_{i} ∥_{1} = 1, R^{⊤} R = I,

\displaystyle\min_{\boldsymbol{\mathbf{R}},\boldsymbol{\mathbf{C}},\boldsymbol{\mathbf{Z}}}\frac{1}{2}\sum_{i=1}^{N}\left\lVert\boldsymbol{\mathbf{x}}_{i}-\boldsymbol{\mathbf{R}}\left[\begin{array}[]{c}\boldsymbol{\mathbf{C}}^{(1)}\boldsymbol{\mathbf{z}}_{i}^{(1)}\\ \vdots\\ \boldsymbol{\mathbf{C}}^{(M)}\boldsymbol{\mathbf{z}}_{i}^{(M)}\end{array}\right]\right\rVert_{2}^{2}

\displaystyle\min_{\boldsymbol{\mathbf{R}},\boldsymbol{\mathbf{C}},\boldsymbol{\mathbf{Z}}}\frac{1}{2}\sum_{i=1}^{N}\left\lVert\boldsymbol{\mathbf{x}}_{i}-\boldsymbol{\mathbf{R}}\left[\begin{array}[]{c}\boldsymbol{\mathbf{C}}^{(1)}\boldsymbol{\mathbf{z}}_{i}^{(1)}\\ \vdots\\ \boldsymbol{\mathbf{C}}^{(M)}\boldsymbol{\mathbf{z}}_{i}^{(M)}\end{array}\right]\right\rVert_{2}^{2}

\mbox s . t . z_{ij}^{(m)} = {0, 1}, z_{i}^{(m)}_{1} = 1, R^{⊤} R = I,

I, k min \frac{1}{2} ∥ I \otimes k - B ∥_{F}^{2} + γ_{I} R_{I} (I) + γ_{k} R_{k} (k),

I, k min \frac{1}{2} ∥ I \otimes k - B ∥_{F}^{2} + γ_{I} R_{I} (I) + γ_{k} R_{k} (k),

E (\uptheta) = \frac{1}{2} ∥ y - f (\uptheta) ∥_{2}^{2} .

E (\uptheta) = \frac{1}{2} ∥ y - f (\uptheta) ∥_{2}^{2} .

E_{α} (\uptheta) = \frac{1}{2} ∥ \hat{y} - f (\uptheta) ∥_{2}^{2} .

E_{α} (\uptheta) = \frac{1}{2} ∥ \hat{y} - f (\uptheta) ∥_{2}^{2} .

\hat{y} = y + α (y - f (\uptheta^{*})),

\hat{y} = y + α (y - f (\uptheta^{*})),

\nabla E_{α} (\uptheta^{*}) = (1 + α) J^{⊤} (\uptheta^{*}) (y - f (\uptheta^{*})) = 0,

\nabla E_{α} (\uptheta^{*}) = (1 + α) J^{⊤} (\uptheta^{*}) (y - f (\uptheta^{*})) = 0,

\nabla^{2} E_{α} (\uptheta^{*}) = J^{⊤} (\uptheta^{*}) J (\uptheta^{*}) + (1 + α) S (\uptheta^{*}) .

S (\uptheta^{*}) = i \sum \nabla^{2} f_{i} (\uptheta^{*}) (y_{i} - f_{i} (\uptheta^{*})) .

S (\uptheta^{*}) = i \sum \nabla^{2} f_{i} (\uptheta^{*}) (y_{i} - f_{i} (\uptheta^{*})) .

E (θ) = \frac{1}{2} ((y_{1} - θ^{2})^{2} + (y_{2} - θ)^{2}) .

E (θ) = \frac{1}{2} ((y_{1} - θ^{2})^{2} + (y_{2} - θ)^{2}) .

E_{t} (\uptheta) = \frac{1}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2},

E_{t} (\uptheta) = \frac{1}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2},

\hat{y}^{(t)}

\hat{y}^{(t)}

r^{(t)}

r^{(t + 1)}_{2}^{2}

r^{(t + 1)}_{2}^{2}

z, \uptheta min \frac{1}{2} ∥ z ∥_{2}^{2}

z, \uptheta min \frac{1}{2} ∥ z ∥_{2}^{2}

z = y - f (\uptheta) .

L_{t} (\uptheta, z, \uplambda) = \frac{1}{2} ∥ z ∥_{2}^{2} + \uplambda^{⊤} (z - y + f (\uptheta)) + \frac{μ ^{(t)}}{2} ∥ z - y + f (\uptheta) ∥_{2}^{2} .

L_{t} (\uptheta, z, \uplambda) = \frac{1}{2} ∥ z ∥_{2}^{2} + \uplambda^{⊤} (z - y + f (\uptheta)) + \frac{μ ^{(t)}}{2} ∥ z - y + f (\uptheta) ∥_{2}^{2} .

\uptheta^{(t + 1)} = \uptheta arg min L_{t} (\uptheta, z^{(t)}, \uplambda^{(t)})

\uptheta^{(t + 1)} = \uptheta arg min L_{t} (\uptheta, z^{(t)}, \uplambda^{(t)})

z^{(t + 1)} = z arg min L_{t} (\uptheta^{(t + 1)}, z, \uplambda^{(t)})

λ^{(t + 1)} = λ^{(t)} + μ^{(t)} (z^{(t + 1)} - y + f (\uptheta)^{(t + 1)})

\uptheta^{(t + 1)} = \uptheta arg min \frac{μ ^{(t)}}{2} y + (\frac{1 - μ ^{(t)}}{μ ^{(t)}}) z^{(t)} - f (\uptheta)_{2}^{2},

\uptheta^{(t + 1)} = \uptheta arg min \frac{μ ^{(t)}}{2} y + (\frac{1 - μ ^{(t)}}{μ ^{(t)}}) z^{(t)} - f (\uptheta)_{2}^{2},

z^{(t + 1)} = (\frac{1}{1 + μ ^{(t)}}) z^{(t)} + (\frac{μ ^{(t)}}{1 + μ ^{(t)}}) (y - f (\uptheta^{(t + 1)})) .

α^{(t)} = (1 - μ^{(t)}) / μ^{(t)},

α^{(t)} = (1 - μ^{(t)}) / μ^{(t)},

p^{(t)} = μ^{(t)} / (1 + μ^{(t)}) .

E (\uptheta) = \frac{1}{2} ∥ y - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

E (\uptheta) = \frac{1}{2} ∥ y - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

E_{t} (\uptheta) = \frac{1}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

E_{t} (\uptheta) = \frac{1}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

E_{t} (\uptheta) = \frac{μ ^{(t)}}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

E_{t} (\uptheta) = \frac{μ ^{(t)}}{2} ∥ \hat{y}^{(t)} - f (\uptheta) ∥_{2}^{2} + γ R (\uptheta) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Image and Video Retrieval Techniques · Remote-Sensing Image Classification

Full text

Residual Expansion Algorithm: Fast and Effective Optimization for

Nonconvex Least Squares Problems

Daiki Ikami Toshihiko Yamasaki Kiyoharu Aizawa

The University of Tokyo, Japan

{ikami, yamasaki, aizawa}@hal.t.u-tokyo.ac.jp

Abstract

We propose the residual expansion (RE) algorithm: a global (or near-global) optimization method for nonconvex least squares problems. Unlike most existing nonconvex optimization techniques, the RE algorithm is not based on either stochastic or multi-point searches; therefore, it can achieve fast global optimization. Moreover, the RE algorithm is easy to implement and successful in high-dimensional optimization. The RE algorithm exhibits excellent empirical performance in terms of k-means clustering, point-set registration, optimized product quantization, and blind image deblurring.

1 Introduction

Many problems in computer vision and machine learning can be formulated as optimization problems. If we can formulate a problem as a convex optimization, we can solve it by convex optimization techniques such as gradient-based methods. However, most optimization problems are nonconvex and often have many local minima. In these cases, convex optimization techniques can find only local minima.

Global optimization of nonconvex problems is an NP-hard problem in most cases. Therefore, heuristic methods are often used to find a global (or near-global) optimum. There are two major approaches: good initialization and stochastic optimization. The former is fast and effective if we can obtain a good initial guess [2]; however, many optimization problems do not provide this. The latter is random search or multiple-point search, which is represented by simulated annealing (SA) [13], particle swarm optimization (PSO) [12], and genetic algorithms (GA) [19]. Although these methods are effective with low-dimensional optimization problems, it is difficult to obtain good solutions with high-dimensional ones. Moreover, these approaches often require excessive computation time to obtain a good solution.

In this paper, we propose a fast and effective optimization method for nonconvex least squares (LS) problems such as k-means clustering and point-set registration. First, we propose a novel measure of convergence called RE convergence: this represents how far we can expand data points along their residual directions under convergence. Fig. 1 shows k-means results and expanded data points. Fig. 1(a) depicts convergence on expanded data while Fig. 1(b) shows a case that is not converged. We presume that RE convergence is associated with global convergence. In fact, we can prove that the solution that is stable on a large expansion is the global optimum in the case of a one-dimensional quartic minimization problem.

Additionally, we propose a heuristic algorithm to find a solution that is stable on the large expansion, which we term the residual expansion (RE) algorithm. This algorithm is based on neither multiple-point search nor random search, and thus fast computation can be achieved.

Our contribution is as follows:

We propose a novel concept of convergence, RE convergence. We show the relationship between RE convergence and the global optimum. 2. 2.

We propose the RE algorithm, which can be applied for any nonconvex LS problem. We show that the RE algorithm is fast, effective, and easy to implement. 3. 3.

We show the RE algorithm’s excellent performance for various nonconvex LS problems such as k-means clustering, point-set registration, optimized product quantization, and blind image deblurring.

2 Related works

2.1 Nonconvex least squares problems

We focus on nonconvex LS problems, of which many exist. In this paper, we study the following four important problems in computer vision and machine learning.

2.1.1 K-means clustering

K-means clustering is one of the most popular clustering methods with various applications such as quantization [11], feature learning [8], and segmentation [1]. K-means clustering assigns data vectors $\boldsymbol{\mathbf{x}}_{1},\ldots,\boldsymbol{\mathbf{x}}_{n}\in\mathbb{R}^{d}$ to the nearest representative clusters. The optimization problem can be formulated as

[TABLE]

where $\boldsymbol{\mathbf{X}}=[\boldsymbol{\mathbf{x}}_{1},\ldots,\boldsymbol{\mathbf{x}}_{n}]\in\mathbb{R}^{d\times n}$ is a data matrix, $\boldsymbol{\mathbf{C}}=[\boldsymbol{\mathbf{c}}_{1},\ldots,\boldsymbol{\mathbf{c}}_{k}]\in\mathbb{R}^{d\times k}$ is a matrix of cluster centroids, and $\boldsymbol{\mathbf{Z}}=[\boldsymbol{\mathbf{z}}_{1},\ldots,\boldsymbol{\mathbf{z}}_{n}]\in\mathbb{R}^{k\times n}$ is an assignment matrix.

The most popular optimization method is Lloyd’s algorithm [17], which has an update step (fix $\boldsymbol{\mathbf{Z}}$ and update $\boldsymbol{\mathbf{C}}$ ) and an assignment step (fix $\boldsymbol{\mathbf{C}}$ and update $\boldsymbol{\mathbf{Z}}$ ). Hartigan’s algorithm [10] achieves better clustering than Lloyd’s algorithm because the set of local minima of Hartigan’s algorithm is a subset of those of Lloyd’s algorithm [26, 25]. For good initialization, k-means++ [2] is often used because of its efficiency and effectiveness.

2.1.2 Point-set registration

Point-set registration is a fundamental problem in computer vision. Here we consider a rigid 3D-point-set registration problem: Given source point sets $\boldsymbol{\mathbf{X}}=[\boldsymbol{\mathbf{x}}_{1},\dots,\boldsymbol{\mathbf{x}}_{n}]\in\mathbb{R}^{3\times n}$ and target point sets $\boldsymbol{\mathbf{Y}}=[\boldsymbol{\mathbf{y}}_{1},\ldots,\boldsymbol{\mathbf{y}}_{m}]\in\mathbb{R}^{3\times m}$ , we estimate the best rigid transformation parameters.

In this paper, we consider the following optimization problem with a point-to-point cost function:

[TABLE]

where $\boldsymbol{\mathbf{R}}\in\mbox{SO}(3)$ is a rotation matrix, $\boldsymbol{\mathbf{t}}\in\mathbb{R}^{3}$ is a translation vector, and $\boldsymbol{\mathbf{Z}}=[\boldsymbol{\mathbf{z}}_{1},\ldots,\boldsymbol{\mathbf{z}}_{n}]\in\mathbb{R}^{k\times n}$ is an assignment matrix. $\boldsymbol{\mathbf{I}}$ is an identity matrix and $\boldsymbol{\mathbf{1}}$ is a vector of all ones.

The iterative closest point (ICP) algorithm [3] is a well-known alternating optimization method: it fixes $\boldsymbol{\mathbf{Z}}$ and updates $\boldsymbol{\mathbf{R}}$ , $\boldsymbol{\mathbf{t}}$ , and then fixes $\boldsymbol{\mathbf{R}},\boldsymbol{\mathbf{t}}$ and updates $\boldsymbol{\mathbf{Z}}$ . To obtain a global minimum, some studies adopt stochastic optimization, such as GA [24], PSO [27], and SA [18]. Recently, Yang et al. proposed Go-ICP [29], which guarantees global optimality by using the branch-and-bound algorithm. However, it requires significant computation time.

2.1.3 Optimized product quantization

Optimized product quantization (OPQ) [9, 22], which is an extension of product quantization (PQ), is an efficient fast approximate nearest neighbor search method. The optimization problem in OPQ is described by

[TABLE]

where $\boldsymbol{\mathbf{X}},\boldsymbol{\mathbf{C}},\boldsymbol{\mathbf{Z}}$ have the same meaning as in Section 2.1.1 and $\boldsymbol{\mathbf{R}}$ is a rotation matrix.

The optimization problem of Eq. (6) can be solved by alternating optimization of $\boldsymbol{\mathbf{R}}$ , $\boldsymbol{\mathbf{C}}$ , and $\boldsymbol{\mathbf{Z}}$ [9, 22]. Ge et al. also proposed a parametric optimization method that assumes the data follows a parametric Gaussian distribution [9].

2.1.4 Blind image deblurring

Blind image deblurring has long been a challenging problem in computer vision. From a blurred image $\boldsymbol{\mathbf{B}}\in\mathbb{R}^{h\times w}$ , we estimate an original image $\boldsymbol{\mathbf{I}}\in\mathbb{R}^{h\times w}$ and blur kernel $\boldsymbol{\mathbf{k}}\in\mathbb{R}^{k\times k}$ by minimizing the following cost function:

[TABLE]

where $R_{I}(\boldsymbol{\mathbf{I}})$ and $R_{k}(\boldsymbol{\mathbf{k}})$ are the regularization terms, and $\otimes$ denotes the convolution operator. For $R_{I}(\boldsymbol{\mathbf{I}})$ , L0-norm (or approximately L0-norm) [28, 23], or L1/L2 functions [15] are proposed to enforce the sharp edges of the original image. For $R_{k}(\boldsymbol{\mathbf{k}})$ , L2-norm [28, 23] or L1-norm [15] are often used. We refer to the paper [16] for a recent comparative study of blind image deblurring.

We can minimize Eq. (7) by alternating optimization of $\boldsymbol{\mathbf{I}}$ and $\boldsymbol{\mathbf{k}}$ . For fast and effective optimization, a coarse-to-fine strategy [7, 15, 28, 23] is generally employed.

2.2 Nonconvex optimization techniques

Most nonconvex optimization techniques are based on stochastic optimization, including GA [19], PSO [12], and SA [13]. These methods generally do not work well or require significant computation time for high-dimensional optimization problems. Several studies [4, 14, 18, 27, 24] have employed these nonconvex optimization techniques to our target problems described in Section 2.1; however, these methods are not often used in practice.

Our approach is related to graduated nonconvexity (GNC) [5], which first solves a simplified optimization problem and then gradually transforms the problem into the original nonconvex problem. The basic concept of graduated optimization methods is extinguishing local minima by using a convexified objective function, and then gradually changing the objective function to the original function. We refer readers to [20] for a recent survey of graduated optimization. In contrast to GNC, our approach is explicitly to escape from poor local minima by using a largely expanded objective function and then gradually transforming it into the original function, as described in Section 4.

3 Residual expansion convergence

First, we describe RE convergence, which indicates how we can expand data along their residual directions. RE convergence is a measure of the depth of convergence, and our proposed algorithm is based on this concept. We show a relationship between the global optimum and RE convergence.

We discuss a nonconvex least squares (LS) optimization problem whose objective function is given by

[TABLE]

Our definitions are as follows.

Definition 3.1 (Residual Expansion).

Let $\boldsymbol{\mathbf{\uptheta}}^{*}$ be a local minimum point of Eq. (8). We define the $\alpha$ -expanded objective function $E_{\alpha}(\boldsymbol{\mathbf{\uptheta}})$ :

[TABLE]

where $\hat{\boldsymbol{\mathbf{y}}}$ is constructed by expanding $\boldsymbol{\mathbf{y}}$ in the residual direction with a magnitude of $\alpha$ as

[TABLE]

We call the operation that constructs the $\alpha$ -expanded objective function residual expansion (with $\alpha$ ).

Definition 3.2 ( $\alpha$ RE convergence).

$\boldsymbol{\mathbf{\uptheta}}^{*}$ * is called $\alpha$ RE convergence if there exists a constant $\alpha\geq 0$ such that $\boldsymbol{\mathbf{\uptheta}}$ is still a local optimum of $E_{\alpha}(\boldsymbol{\mathbf{\uptheta}})$ . In particular, the maximum (or supremum) constant is called the RE constant111If $\boldsymbol{\mathbf{\uptheta}}^{*}$ is always a local minimum of $E_{\alpha}(\boldsymbol{\mathbf{\uptheta}}^{*})$ under all $\alpha\geq 0$ , we define the RE constant as $\infty$ . .*

Our hypothesis is that a solution with a larger $\alpha$ -RE constant is closer to the global optimum solution. This hypothesis holds in the case of quartic minimization, as presented in section 3.1.1.

3.1 Unconstrained and differentiable problems

We consider one of the simplest cases: unconstrained and differentiable LS problems. Given a local optimum $\boldsymbol{\mathbf{\uptheta}}^{*}$ , we can obtain first- and second-order derivatives of the $\alpha$ -expanded objective function $E_{\alpha}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ at $\boldsymbol{\mathbf{\uptheta^{*}}}$ as

[TABLE]

where $J$ is a Jacobian matrix and $\boldsymbol{\mathbf{S}}\left(\boldsymbol{\mathbf{\uptheta}}^{*}\right)$ is

[TABLE]

Eq. (11) means that $\boldsymbol{\mathbf{\uptheta}}^{*}$ is always a stationary point of $E_{\alpha}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ . Therefore, $\boldsymbol{\mathbf{\uptheta}}^{*}$ is a local minimum of $E_{\alpha}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ if and only if $\nabla^{2}E_{\alpha}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ is a positive semi-definite (PSD) matrix. If $\boldsymbol{\mathbf{S}}$ is not a PSD matrix, there is a $\alpha\geq 0$ which satisfies the fact that $\nabla^{2}E_{\alpha}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ is not a PSD matrix.

Fig. 2 shows examples of $\alpha$ -expanded objective functions. Residual expansion elevates the objective function around $\uptheta^{*}$ , and if $\alpha$ is sufficiently large then it ceases to be a local minimum.

One-dimensional quartic minimization: Here we consider a quartic minimization problem—in particular, one that can be formulated as an LS problem:

[TABLE]

We consider the case where Eq. (14) has two local minima $\theta_{1}$ and $\theta_{2}$ . The following theorem then holds:

Theorem 1.

Let $\theta_{1}$ and $\theta_{2}$ be local minima points of Eq. (14) with RE constants of $\alpha_{1}$ and $\alpha_{2}$ , respectively. $\theta_{1}$ is the global minimum point if $\alpha_{1}>\alpha_{2}$ and $\theta_{2}$ is the global minimum point otherwise.

Proof.

Please refer to the supplementary materials. ∎

3.2 General relationship between the $\alpha$ RE convergence and the global optimum

It is not obvious when our hypothesis, i.e., that a solution with a larger RE constant is closer to the global optimum, is valid. Unfortunately, we can easily find a counterexample in k-means clustering, as shown in Fig. 3. However, our algorithm, which aims to find a solution with a large RE constant, works well from an empirical perspective in many nonconvex LS problems.

4 Residual expansion algorithm

In this section, we propose the RE algorithm, which aims to find a solution with a large RE constant. Since it is difficult to find the solution with the largest RE constant exactly, we employ a heuristic strategy.

The RE algorithm has two steps: parameter updating and residual expansion. We show an intuitive illustration of the algorithm in Fig. 4. For the residual expansion step, we expand data along their residual direction. This results in elevating the objective function around the current solution as in Fig. 2. For the parameter-updating step, instead of minimizing the original function Eq. (8), we minimize the following expanded objective function in each iteration:

[TABLE]

where $\hat{\boldsymbol{\mathbf{y}}}^{(t)}$ is an expanded data vector:

[TABLE]

where $\alpha$ and $0<p\leq 1$ are expansion and momentum parameters, respectively. Note that, if $p=1$ , Eq. (15) is an exactly $\alpha^{(t)}$ -expanded objective function on $\boldsymbol{\mathbf{\uptheta}}^{(t)}$ . The momentum parameter is important for achieving good performance and ensuring that the RE algorithm does not to diverge, as described later.

The RE algorithm iterates through parameter updating by minimizing Eq. (15) and residual expansions by Eq. (16) and Eq. (17). We use a large $\alpha^{(0)}$ initially to find a solution with a large RE constant. Then we decrease $\alpha^{(t)}$ gradually to achieve convergence, analogous to a temperature parameter in SA. We summarize the RE algorithm in Alg. 1.

4.1 Parameter setting for convergence

The RE algorithm has two parameters, $\alpha$ and $p$ , for each iteration. We decrease $\alpha^{(t)}$ to 0 for convergence. when $\alpha=0$ , there is no residual expansion and RE algorithm is guaranteed to converge if the original LS problem has a convergence-guaranteed algorithm. However, inadequate parameters of $\alpha$ and $p$ cause unstable optimization. We consider the norm of $\boldsymbol{\mathbf{r}}^{(t+1)}$ . We can obtain

[TABLE]

We use $\boldsymbol{\mathbf{f}}\left(\boldsymbol{\mathbf{\uptheta}}^{(t+1)}\right)\sim\hat{\boldsymbol{\mathbf{y}}}^{(t)}$ for the last approximation of Eq. (18). Eq. (18) suggests $\left(1-p-\alpha p\right)^{2}\leq 1$ to make the RE algorithm stable. A good way to determine these values of $\alpha$ and $p$ is described in Section 5.

4.2 Advantages of the RE algorithm

Our algorithm consists of two steps of parameter updating and residual expansion. Parameter updating is based simply on a typical LS problem approach. Therefore, if there is a source code which minimizes Eq. (8), for example, by alternative optimization or gradient methods, then we can implement our algorithm by adding a residual expansion step to the existing code, which can be done in a few lines of code.

Moreover, the computational complexity of residual expansion is generally less than that of parameter updating. Therefore, we can achieve faster optimization than most nonconvex optimization techniques based on multi-point search or random search, such as SA and GA.

As described in Section 2.2, GNC is a similar approach to ours; however GNC often does not apply for LS problems. Our algorithm can be applied for any nonconvex LS problem provided that there is a method for finding a local optimum, such as Lloyd’s algorithm for k-means clustering and ICP algorithms for point-set registration.

5 Alternate direction method of multipliers for least squares problems

In this section, we apply the alternate direction method of multipliers (ADMM) [6] to solve Eq. (8). We show that ADMM is a special case of the RE algorithm for LS problems. Moreover, ADMM suggests a modified RE algorithm for regularized LS problems.

We introduce an auxiliary variable $\boldsymbol{\mathbf{z}}=\boldsymbol{\mathbf{y}}-\boldsymbol{\mathbf{f}}\left(\boldsymbol{\mathbf{\uptheta}}\right)$ and rewrite Eq. (8) as a constrained optimization problem:

[TABLE]

We can construct the augmented Lagrangian function of Eq. (19) as

[TABLE]

We take the alternating direction approach for solving Eq. (20) and then obtain update rules as

[TABLE]

5.1 Relation to the RE algorithm

We can simplify Eq. (21), Eq. (22) and Eq. (23) as

[TABLE]

Details of the derivation are described in the supplementary material. This is a special case of the RE algorithm of Eq. (16) and Eq. (17) with

[TABLE]

There are two main advantages to using ADMM. First, we can choose only $\mu$ instead of parameters $\alpha$ and $p$ in the general RE algorithm. Eq. (26) and Eq. (27) always satisfy $(1-p-\alpha p)^{2}<1$ , which is a condition necessary for avoiding divergence to infinity, as described in Section 4.1, and this update achieves good performance in experiments. Second, ADMM can treat regularized LS optimization problems, such as blind image deblurring (Eq. (7)). We will describe this in the next section.

5.2 Regularized least squares problems

We consider a regularized LS problem as follows:

[TABLE]

When we apply the RE algorithm in a straightforward manner, we can obtain the following objective function in each iteration:

[TABLE]

In the case of ADMM, from Eq. (24), the objective function is as follows:

[TABLE]

We can find that the difference between Eq. (29) and Eq. (30) is simply the coefficient of the squared term. We find that minimizing Eq. (30) achieves better performance than minimizing Eq. (29). We summarize the RE algorithm based on ADMM in Alg. 2.

6 Implementation details

We used the RE algorithm based on ADMM (Alg. 2) unless otherwise stated. In the update of $\boldsymbol{\mathbf{\uptheta}}$ (line 2 in Alg. 2), we perform alternating optimization with a single iteration; for example, with k-means clustering, the cluster centers and assignments are updated only once. The four problems we treat in this paper can be minimized by alternating optimization.

For the parameter $\mu$ , we adapt $\mu^{(t+1)}=\min(\rho\mu^{(t)},1)$ , where $\rho=\exp(-\log(\mu^{(0)})/T)$ to satisfy $\mu^{(T)}=1$ . Therefore, we only need to determine the two parameters $\mu^{(0)}$ and $T$ in our method.

7 Experimental results

We evaluate the performance of the RE algorithm on four nonconvex LS problems: k-means clustering, 3D point set registration, OPQ, and single blind image deblurring. All experiments were executed on an Intel Core i5-4200U CPU (1.60 GHz) with 8GB of RAM, and were implemented in MATLAB222Our codes will be available if the paper is accepted. except for Go-ICP [29]. For Go-ICP and its comparison experiment, we used the publicly available code333http://iitlab.bit.edu.cn/mcislab/~yangjiaolong/go-icp/ implemented in C++.

7.1 K-means clustering

We compared our method with k-means++ [2], which is a good initialization method, and Hartigan’s algorithm [10]. For Hartigan’s algorithm, we first used Lloyd’s algorithm [17] with k-means++ initialization for fast computation. We reported the total time of Lloyd’s algorithm and Hartigan’s algorithm. For the other method, we used Lloyd’s algorithm for optimization. We used random initialization for the RE algorithm. For error measurement, we used the objective function value of Eq. (1) and reported relative error from the value of k-means++ (therefore, the relative error of k-means++ is always 1).

7.1.1 Synthetic data experiments

We start with two synthetic datasets as shown in Fig. 5. We repeated each method 50 times from different initializations and report the average relative errors. Table 1 shows the results of our method with different $\mu^{(0)}$ and $T$ . We found that larger $T$ achieved better performance. We also found that smaller $\mu^{(0)}$ achieved better performance in dataset B; however, larger $\mu^{(0)}$ achieved better performance in dataset A. This indicates that the best setting $\mu^{(0)}$ is different for different data distributions. Intuitively, dataset B requires a larger residual expansion (in other words, small $\mu^{(0)}$ ) to escape from a poor local minimum, while dataset A requires a smaller residual expansion.

We show comparison results in Table 2. We repeated each method 50 times from different initializations. K-means++ worked well with dataset B. On the other hand, Hartigan’s algorithm can improve the results of k-means++ in dataset A; however, this does not work in dataset B. The RE algorithm worked best in both cases, even though it was initialized by random seeding. Moreover, the RE algorithm with $T=30$ achieved comparable speed to k-means++ with better performance for dataset A.

7.1.2 Real-world data experiments

We used two real-world datasets for comparison: the cloud dataset444https://archive.ics.uci.edu/ml/datasets/Cloud and the COIL20 dataset [21]. We performed experiments in the same manner as in Section 7.1.1.

Table 3 shows comparative results. In the cloud dataset, k-means++ achieves faster and better clustering than random seeding. The RE algorithm with $T=30$ achieved better clustering than k-means++ with about 1.8 times the computational cost. The RE algorithm with $T=1000$ performed best, and found the near-global optimum in every case. For the COIL20 dataset, although k-means++ and Hartigan’s algorithm did not work well, the RE algorithm significantly outperformed the other methods.

7.2 Point set registration

We compared the RE algorithm with the ICP algorithm and Go-ICP [29]. Go-ICP is known as a method that can achieve global optimization. We used the bunny model from the Stanford3D dataset555http://graphics.stanford.edu/data/3Dscanrep/, as in Fig. 6. For the target model, we used a partial point set as in Fig. 6(b). In the experiments, point sets were normalized within a cube of $[-1,1]^{3}$ .

We made a rotation matrix $\boldsymbol{\mathbf{R}}_{gt}$ from a random rotation axis and the rotation angle $\phi$ . The target point set was constructed by this rotation matrix, and we added Gaussian noise with a standard deviation of $\sigma=0.03$ . We performed 50 tests with different random rotation axes at each rotation angle $\phi=\pi/3,5\pi/12,\pi/2$ . For measurement of the error, we used the objective value Eq. (2) and regarded the results as successful if the objective error was less than 1 (this value is approximately twice of the average objective value using Go-ICP, as in Fig. 7).

We first show the comparison results between the RE algorithm and the ICP algorithm as in Table 4. The RE algorithm with $T=30$ achieved a better success rate with almost the same number of iterations as the ICP algorithm. Using a large $T$ can improve the results to a small extent.

We also compare our method to Go-ICP [29]. Go-ICP has two steps: the ICP algorithm and the branch-and-bound algorithm. We compared the original Go-ICP and RE + Go-ICP, which has the two steps of ICP with the RE and branch-and-bound algorithms. Fig. 7 plots all 50 results in $\phi=5\pi/12$ . Note that this comparison was implemented entirely in C++. Go-ICP always found the global optimum solution; however, it required significant computation. RE + Go-ICP reduced computational cost while achieving global optimization.

7.3 Optimized product quantization

We show that the RE algorithm is successful in OPQ optimization problems. We compare our method with the alternating optimization method [9, 22]. For a dataset, we used SIFT 1M [11], which contains 100,000 128-dimensional SIFT descriptors for training. We set the subspace number $M=8$ and cluster number $k=256$ , which are often used in the field of approximate nearest neighbor search. For error measurement, we used the objective function value of Eq. (6). For our method, we set $\mu^{(0)}=0.5$ .

We plot the objective function value versus iteration number in Fig. 8. We performed five repetitions using different initializations and report the average objective values obtained. Our method improved the objective function value; moreover, it achieved rapid convergence in the cases of $T=30$ and $T=100$ . The RE algorithm elevates the objective function around the current solution; in other words, it transforms the gradient for the current solution into a steeper gradient, potentially causing rapid convergence.

7.4 Blind image deblurring

We evaluated our method with single blind image deblurring. There are many formulations for blind image deblurring. In this paper, we followed Pan et al.’s formulation [23], which can be minimized by alternating optimization. We compared three methods as follows: a coarse-to-fine strategy [23] and RE algorithms based on Alg. 1 and Alg. 2. We used the uniform blurred text images from the dataset provided by Lai et al. [16], which contains five latent images and four blurring kernels for a total of 20 blurred images. For all methods, we used the same objective function parameters, such as the regularization coefficients. For our method, we set $\mu^{(0)}=0.2$ and $T=100$ .

We show the results in Table 5. Our method significantly outperforms Pan’s method [23] and is successful for a significantly blurred image, as in Fig. 9. We found that Alg. 2 is superior to Alg. 1 in the cases of {image #3, kernel #4} and {image #5, kernel #4}. Note that these results are obtained by minimizing the same objective function, however using different optimization methods. Therefore our method likely improves upon other methods which use different objective functions [15, 28].

8 Conclusion

We proposed the RE algorithm, which is a novel global optimization algorithm for nonconvex LS problems. This method is based on a novel measurement of global convergence called RE convergence. We presented theoretical analysis of RE convergence and empirical results showing excellent performance of the RE algorithm for various optimization problems.

There remain many open questions in both theoretical and empirical aspects. We can guarantee that the solution with the largest RE constant is the global optimum in limited cases. To which problems this applies remains unknown. We plan to investigate the applicability of the RE algorithm to other nonconvex optimization problems, including non-LS problems.

Acknowledgement

This research is partially supportted by CREST (JPMJCR1686) and KAKENHI（15K12025）

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. TPAMI , 34(11):2274–2282, 2012.
2[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA , pages 1027–1035, 2007.
3[3] P. J. Besl and H. D. Mc Kay. A method for registration of 3-D shapes. TPAMI , 14(2):239–256, 1992.
4[4] G. Blais and M. D. Levine. Registering multiview range data to create 3D computer objects. TPAMI , 17(8):820–824, 1995.
5[5] A. Blake and A. Zisserman. Visual reconstruction . MIT Press, 1987.
6[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning , 3(1):1–122, 2011.
7[7] S. Cho and S. Lee. Fast motion deblurring. ACM To G , 28(5):145:1–145:8, 2009.
8[8] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV , pages 1–22, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Residual Expansion Algorithm: Fast and Effective Optimization for

Abstract

1 Introduction

2 Related works

2.1 Nonconvex least squares problems

2.1.1 K-means clustering

2.1.2 Point-set registration

2.1.3 Optimized product quantization

2.1.4 Blind image deblurring

2.2 Nonconvex optimization techniques

3 Residual expansion convergence

Definition 3.1** (Residual Expansion).**

Definition 3.2** (α\alphaα RE convergence).**

3.1 Unconstrained and differentiable problems

Theorem 1**.**

Proof.

3.2 General relationship between the α\alphaα RE convergence and the global optimum

4 Residual expansion algorithm

4.1 Parameter setting for convergence

4.2 Advantages of the RE algorithm

5 Alternate direction method of multipliers for least squares problems

5.1 Relation to the RE algorithm

5.2 Regularized least squares problems

6 Implementation details

7 Experimental results

7.1 K-means clustering

7.1.1 Synthetic data experiments

7.1.2 Real-world data experiments

7.2 Point set registration

7.3 Optimized product quantization

7.4 Blind image deblurring

8 Conclusion

Acknowledgement

Definition 3.1 (Residual Expansion).

Definition 3.2 ( $\alpha$ RE convergence).

Theorem 1.

3.2 General relationship between the $\alpha$ RE convergence and the global optimum