Feature Graph Learning for 3D Point Cloud Denoising

Wei Hu; Xiang Gao; Gene Cheung; Zongming Guo

arXiv:1907.09138·cs.CV·June 24, 2020

Feature Graph Learning for 3D Point Cloud Denoising

Wei Hu, Xiang Gao, Gene Cheung, Zongming Guo

PDF

TL;DR

This paper introduces a novel feature graph learning method that optimizes a Mahalanobis distance-based graph kernel using a single observation, significantly improving 3D point cloud denoising performance.

Contribution

It proposes an efficient optimization algorithm for learning feature graphs from minimal data, with applications to 3D point cloud denoising, achieving state-of-the-art results.

Findings

01

Achieves superior denoising performance compared to existing methods.

02

Develops a fast, eigen-decomposition-free optimization algorithm.

03

Demonstrates effectiveness with high-dimensional feature vectors.

Abstract

Identifying an appropriate underlying graph kernel that reflects pairwise similarities is critical in many recent graph spectral signal restoration schemes, including image denoising, dequantization, and contrast enhancement. Existing graph learning algorithms compute the most likely entries of a properly defined graph Laplacian matrix $L$ , but require a large number of signal observations $z$ 's for a stable estimate. In this work, we assume instead the availability of a relevant feature vector $f_{i}$ per node $i$ , from which we compute an optimal feature graph via optimization of a feature metric. Specifically, we alternately optimize the diagonal and off-diagonal entries of a Mahalanobis distance matrix $M$ by minimizing the graph Laplacian regularizer (GLR) $z^{⊤} Lz$ , where edge weight is $w_{i,j} =…

Tables5

Table 1. TABLE I: List of abbreviations and their descriptions.

Abbreviation	Description
GLR	Graph Laplacian Regularizer
PD	Positive Definite
PSD	Positive Semi-Definite
PG	Proximal Gradient
IGMRF	Intrinsic Gaussian Markov Random Fields
MAP	Maximum a Posteriori
MLS	Moving Least Squares
RIMLS	Robust Implicit MLS
APSS	Algebraic Point Set Surfaces
LOP	Locally Optimal Projection
WLOP	Weighted LOP
AWLOP	Anisotropic WLOP
NLD	Non-Local Denoising
LR	Low Rank

Table 2. TABLE II: Comparison results under Gaussian noise σ = 0.05 𝜎 0.05 \sigma=0.05 and sampling rate 20 % percent 20 20\% in Benchmark .

	Anchor	Daratech	DC	Gargoyle	Quasimoto
MSE
Ours	0.240	0.301	0.223	0.256	0.197
Diagonal	0.251	0.339	0.241	0.275	0.205
SNR
Ours	48.17	43.78	47.04	46.93	47.87
Diagonal	47.72	42.59	46.23	46.20	47.50

Table 3. TABLE III: MSE comparison for different models in Benchmark with Gaussian noise.

$σ = 0.02$
Model	Noisy	APSS	RIMLS	AWLOP	NLD	MRPCA	LR	GLR	Diagonal	Baseline1	Baseline2	Ours
Anchor	0.259	0.208	0.212	0.237	0.231	0.202	0.228	0.189	0.199	0.197	0.198	0.194
Daratech	0.245	0.203	0.209	0.228	0.222	0.225	0.213	0.197	0.198	0.196	0.195	0.192
DC	0.237	0.186	0.198	0.211	0.206	0.189	0.206	0.177	0.180	0.180	0.180	0.177
Gargoyle	0.257	0.208	0.217	0.230	0.230	0.215	0.240	0.202	0.205	0.204	0.204	0.200
Quasimoto	0.224	0.171	0.183	0.196	0.190	0.171	0.180	0.162	0.162	0.163	0.162	0.161
Average	0.244	0.195	0.203	0.220	0.215	0.200	0.213	0.185	0.189	0.188	0.188	0.184
$σ = 0.03$
Anchor	0.322	0.239	0.244	0.259	0.265	0.230	0.246	0.217	0.225	0.227	0.225	0.221
Daratech	0.303	0.242	0.258	0.298	0.258	0.259	0.252	0.238	0.243	0.244	0.244	0.236
DC	0.293	0.210	0.226	0.257	0.235	0.211	0.221	0.203	0.205	0.207	0.204	0.200
Gargoyle	0.318	0.239	0.252	0.294	0.262	0.241	0.257	0.233	0.233	0.237	0.233	0.228
Quasimoto	0.274	0.188	0.203	0.226	0.217	0.187	0.193	0.176	0.179	0.181	0.179	0.175
Average	0.302	0.223	0.236	0.266	0.247	0.225	0.233	0.213	0.217	0.219	0.217	0.212
$σ = 0.04$
Anchor	0.372	0.254	0.263	0.306	0.297	0.242	0.259	0.228	0.236	0.243	0.236	0.232
Daratech	0.348	0.282	0.308	0.286	0.295	0.288	0.283	0.276	0.280	0.286	0.284	0.274
DC	0.338	0.227	0.254	0.270	0.269	0.223	0.234	0.228	0.223	0.225	0.219	0.215
Gargoyle	0.368	0.262	0.277	0.297	0.294	0.257	0.269	0.257	0.253	0.259	0.253	0.245
Quasimoto	0.318	0.201	0.219	0.218	0.252	0.199	0.204	0.187	0.195	0.195	0.191	0.182
Average	0.348	0.245	0.264	0.275	0.281	0.241	0.249	0.235	0.237	0.242	0.237	0.229
$σ = 0.05$
Anchor	0.417	0.267	0.281	0.315	0.331	0.253	0.270	0.244	0.246	0.248	0.246	0.240
Daratech	0.387	0.350	0.373	0.359	0.330	0.325	0.347	0.308	0.319	0.326	0.322	0.301
DC	0.381	0.251	0.265	0.324	0.306	0.239	0.247	0.241	0.231	0.247	0.245	0.222
Gargoyle	0.412	0.292	0.305	0.365	0.334	0.277	0.281	0.273	0.266	0.274	0.268	0.256
Quasimoto	0.362	0.229	0.242	0.267	0.291	0.207	0.218	0.209	0.195	0.203	0.196	0.193
Average	0.392	0.278	0.293	0.326	0.318	0.260	0.273	0.255	0.251	0.260	0.255	0.242
$σ = 0.10$
Anchor	0.631	0.389	0.402	0.536	0.571	0.382	0.398	0.407	0.352	0.360	0.350	0.333
Daratech	0.533	0.504	0.542	0.466	0.508	0.445	0.431	0.446	0.404	0.404	0.402	0.398
DC	0.575	0.403	0.464	0.502	0.529	0.405	0.402	0.389	0.378	0.390	0.385	0.368
Gargoyle	0.619	0.444	0.475	0.535	0.564	0.423	0.428	0.438	0.419	0.428	0.426	0.416
Quasimoto	0.561	0.402	0.414	0.473	0.523	0.388	0.387	0.356	0.293	0.312	0.304	0.286
Average	0.584	0.428	0.459	0.502	0.539	0.409	0.409	0.407	0.369	0.379	0.373	0.360

Table 4. TABLE IV: SNR (dB) comparison for different models in Benchmark with Gaussian noise.

$σ = 0.02$
Model	Noisy	APSS	RIMLS	AWLOP	NLD	MRPCA	LR	GLR	Diagonal	Baseline1	Baseline2	Ours
Anchor	47.41	49.61	49.41	48.31	48.53	49.88	48.69	50.55	50.03	50.14	50.11	50.30
Daratech	45.85	47.71	47.44	46.56	46.82	46.72	47.27	48.02	47.96	48.08	48.12	48.29
DC	46.42	48.83	48.23	47.59	47.82	48.68	47.83	49.34	49.19	49.16	49.16	49.33
Gargoyle	46.91	49.01	48.57	48.01	48.01	48.66	47.57	49.30	49.18	49.20	49.19	49.40
Quasimoto	46.61	49.27	48.60	47.92	48.22	49.27	48.78	49.81	49.85	49.77	49.82	49.90
Average	46.67	48.88	48.44	47.67	47.87	48.64	48.02	49.40	49.24	49.27	49.28	49.44
$σ = 0.03$
Anchor	45.25	48.24	48.00	46.69	47.16	48.60	47.91	49.20	48.82	48.72	48.83	48.99
Daratech	43.70	46.00	45.46	45.12	45.34	45.18	45.59	46.13	45.92	45.89	45.91	46.22
DC	44.32	47.64	46.94	46.04	46.49	47.62	47.10	47.94	47.84	47.76	47.90	48.11
Gargoyle	44.75	47.63	47.12	46.39	46.68	47.52	46.88	47.87	47.89	47.70	47.86	48.09
Quasimoto	44.58	48.34	47.57	46.53	46.89	48.40	48.09	49.00	48.83	48.72	48.82	49.06
Average	44.52	47.57	47.01	46.15	46.51	47.46	47.11	48.02	47.86	47.76	47.86	48.09
$σ = 0.04$
Anchor	43.78	47.60	47.27	45.74	46.02	48.09	47.41	48.67	48.34	48.04	48.35	48.51
Daratech	42.34	44.46	43.58	44.32	43.98	44.25	44.41	44.64	44.53	44.30	44.36	44.73
DC	42.86	46.84	45.71	45.11	45.15	47.00	46.54	46.80	47.03	46.93	47.21	47.38
Gargoyle	43.31	46.69	46.14	45.44	45.53	46.88	46.44	46.89	47.05	46.81	47.06	47.37
Quasimoto	43.09	47.68	46.80	46.85	45.40	47.80	47.52	48.40	48.00	47.97	48.20	48.67
Average	43.07	46.65	45.90	45.49	45.21	46.80	46.46	47.08	46.99	46.81	47.04	47.33
$σ = 0.05$
Anchor	42.65	47.09	46.59	45.44	44.94	47.64	46.99	48.00	47.90	47.80	47.92	48.17
Daratech	41.28	42.29	41.64	42.02	42.87	43.03	42.37	43.56	43.21	42.99	43.11	43.80
DC	41.68	45.86	45.30	43.27	43.85	46.33	46.01	46.24	46.68	45.98	46.06	47.07
Gargoyle	42.17	45.61	45.18	43.37	44.28	46.12	45.99	46.28	46.56	46.25	46.45	46.93
Quasimoto	41.79	46.36	45.83	44.83	43.99	47.39	46.85	47.28	47.96	47.55	47.94	48.08
Average	41.91	45.44	44.91	43.79	43.99	46.10	45.64	46.27	46.46	46.11	46.30	46.81
$σ = 0.10$
Anchor	38.52	43.35	43.04	40.13	39.51	43.52	43.12	42.89	44.33	44.10	44.38	44.87
Daratech	38.10	38.66	37.92	39.43	38.57	39.88	40.24	39.86	40.84	40.82	40.88	41.00
DC	37.57	41.14	39.72	38.91	38.38	41.07	41.13	41.45	41.71	41.39	41.54	42.00
Gargoyle	38.12	41.44	40.78	39.57	39.03	41.91	41.79	41.56	41.97	41.76	41.82	42.06
Quasimoto	37.43	40.78	40.49	39.13	38.13	41.11	41.13	41.96	43.88	43.26	43.53	44.14
Average	37.95	41.07	40.39	39.43	38.72	41.50	41.48	41.54	42.55	42.27	42.43	42.81

Table 5. TABLE V: MSE Comparison for different optimization algorithms under Gaussian noise σ = 0.05 𝜎 0.05 \sigma=0.05 in Benchmark .

	Anchor	Daratech	DC	Gargoyle	Quasimoto
Vanilla PG	0.240	0.301	0.222	0.256	0.191
Ours	0.240	0.301	0.222	0.256	0.193

Equations104

z^{⊤} Lz = i = 1 \sum N j = 1 \sum N w_{i, j} (z_{i} - z_{j})^{2} < ϵ,

z^{⊤} Lz = i = 1 \sum N j = 1 \sum N w_{i, j} (z_{i} - z_{j})^{2} < ϵ,

z^{⊤} Lz = k \sum λ_{k} α_{k}^{2}

z^{⊤} Lz = k \sum λ_{k} α_{k}^{2}

z^{⊤} L (z) z = i \sim j \sum w_{i, j} (z_{i}, z_{j}) \cdot (z_{j} - z_{i})^{2},

z^{⊤} L (z) z = i \sim j \sum w_{i, j} (z_{i}, z_{j}) \cdot (z_{j} - z_{i})^{2},

δ_{i, j} = \frac{( x _{i} - x _{j} ) ^{2}}{σ _{x}^{2}} + \frac{∥ l _{i} - l _{j} ∥ _{2}^{2}}{σ _{l}^{2}},

δ_{i, j} = \frac{( x _{i} - x _{j} ) ^{2}}{σ _{x}^{2}} + \frac{∥ l _{i} - l _{j} ∥ _{2}^{2}}{σ _{l}^{2}},

\displaystyle\delta_{i,j}=\left(\mathbf{f}_{i}-\mathbf{f}_{j}\right)^{\top}\left[\begin{array}[]{cc}1/\sigma_{x}^{2}&0\\ 0&1/\sigma_{l}^{2}\end{array}\right]\left(\mathbf{f}_{i}-\mathbf{f}_{j}\right).

\displaystyle\delta_{i,j}=\left(\mathbf{f}_{i}-\mathbf{f}_{j}\right)^{\top}\left[\begin{array}[]{cc}1/\sigma_{x}^{2}&0\\ 0&1/\sigma_{l}^{2}\end{array}\right]\left(\mathbf{f}_{i}-\mathbf{f}_{j}\right).

δ_{i, j} = (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j}),

δ_{i, j} = (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j}),

M = Q Λ Q^{⊤}

M = Q Λ Q^{⊤}

= = (\nabla f_{i, j})^{⊤} M (\nabla f_{i, j}) [\nabla f_{i, j}^{(1)} \nabla f_{i, j}^{(2)}] Q [1 - 1 11] Λ [1000] Q^{⊤} [11 - 1 1] [\nabla f_{i, j}^{(1)} \nabla f_{i, j}^{(2)}] (\nabla f_{i, j}^{(1)} - \nabla f_{i, j}^{(2)})^{2}

= = (\nabla f_{i, j})^{⊤} M (\nabla f_{i, j}) [\nabla f_{i, j}^{(1)} \nabla f_{i, j}^{(2)}] Q [1 - 1 11] Λ [1000] Q^{⊤} [11 - 1 1] [\nabla f_{i, j}^{(1)} \nabla f_{i, j}^{(2)}] (\nabla f_{i, j}^{(1)} - \nabla f_{i, j}^{(2)})^{2}

(v_{i} - v_{j})^{2}

(v_{i} - v_{j})^{2}

= (\nabla f_{i, j}^{(1)} - \nabla f_{i, j}^{(2)})^{2}

M min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. M ≻ 0.

M min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. M ≻ 0.

M min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. M ≻ 0; tr (M) \leq C .

M min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. M ≻ 0; tr (M) \leq C .

{m_{i, i}} min {i, j} \sum exp {- g_{i, j}^{⊤} (M^{'} + diag (M)) g_{i, j}} d_{i, j} s.t. M ≻ 0; i \sum m_{i, i} \leq C .

{m_{i, i}} min {i, j} \sum exp {- g_{i, j}^{⊤} (M^{'} + diag (M)) g_{i, j}} d_{i, j} s.t. M ≻ 0; i \sum m_{i, i} \leq C .

{m_{i, i}} min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2} - g_{i, j}^{⊤} M^{'} g_{i, j}} d_{i, j} s.t. m_{i, i} - j \neq = i \sum ∣ m_{i, j} ∣ > 0, \forall i; i \sum m_{i, i} \leq C .

{m_{i, i}} min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2} - g_{i, j}^{⊤} M^{'} g_{i, j}} d_{i, j} s.t. m_{i, i} - j \neq = i \sum ∣ m_{i, j} ∣ > 0, \forall i; i \sum m_{i, i} \leq C .

{m_{i, i}} min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2}} \tilde{d}_{i, j} s.t. m_{i, i} - j \neq = i \sum ∣ m_{i, j} ∣ > 0, \forall i; i \sum m_{i, i} \leq C .

{m_{i, i}} min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2}} \tilde{d}_{i, j} s.t. m_{i, i} - j \neq = i \sum ∣ m_{i, j} ∣ > 0, \forall i; i \sum m_{i, i} \leq C .

\mathcal{S}=\left\{\mathbf{m}~{}\bigg{|}~{}m_{i,i}>\sum_{j\neq i}|m_{i,j}|,\sum_{i=1}^{K}m_{i,i}\leq C,\forall i\right\}.

\mathcal{S}=\left\{\mathbf{m}~{}\bigg{|}~{}m_{i,i}>\sum_{j\neq i}|m_{i,j}|,\sum_{i=1}^{K}m_{i,i}\leq C,\forall i\right\}.

I_{\mathcal{S}}(\mathbf{m})=\left\{\begin{array}[]{lr}0,&\mathbf{m}\in\mathcal{S}\\ \infty,&\text{otherwise}\end{array}\right.

I_{\mathcal{S}}(\mathbf{m})=\left\{\begin{array}[]{lr}0,&\mathbf{m}\in\mathcal{S}\\ \infty,&\text{otherwise}\end{array}\right.

m min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2}} \tilde{d}_{i, j} + I_{S} (m) .

m min {i, j} \sum exp {- k \sum m_{k, k} g_{i, j} (k)^{2}} \tilde{d}_{i, j} + I_{S} (m) .

\begin{split}&\bigtriangledown F(\mathbf{m})=\\ &\left[\begin{array}[]{c}-\sum\limits_{\{i,j\}}\exp\left\{-\sum\limits_{k}m_{k,k}\mathbf{g}_{i,j}(k)^{2}\right\}\mathbf{g}_{i,j}(1)^{2}\tilde{d}_{i,j}\\ \vdots\\ -\sum\limits_{\{i,j\}}\exp\left\{-\sum\limits_{k}m_{k,k}\mathbf{g}_{i,j}(k)^{2}\right\}\mathbf{g}_{i,j}(K)^{2}\tilde{d}_{i,j}\end{array}\right]\end{split}

\begin{split}&\bigtriangledown F(\mathbf{m})=\\ &\left[\begin{array}[]{c}-\sum\limits_{\{i,j\}}\exp\left\{-\sum\limits_{k}m_{k,k}\mathbf{g}_{i,j}(k)^{2}\right\}\mathbf{g}_{i,j}(1)^{2}\tilde{d}_{i,j}\\ \vdots\\ -\sum\limits_{\{i,j\}}\exp\left\{-\sum\limits_{k}m_{k,k}\mathbf{g}_{i,j}(k)^{2}\right\}\mathbf{g}_{i,j}(K)^{2}\tilde{d}_{i,j}\end{array}\right]\end{split}

\Pi_{I_{\mathcal{S}}}(\mathbf{v})=\left\{\begin{array}[]{lr}P_{\mathcal{T}}(\mathbf{v}),&\mathbf{1}^{\top}P_{\mathcal{T}}(\mathbf{v})\leq C,\\ P_{\mathcal{T}}(\mathbf{v}-\alpha\cdot\mathbf{1}),&\text{otherwise,}\end{array}\right.

\Pi_{I_{\mathcal{S}}}(\mathbf{v})=\left\{\begin{array}[]{lr}P_{\mathcal{T}}(\mathbf{v}),&\mathbf{1}^{\top}P_{\mathcal{T}}(\mathbf{v})\leq C,\\ P_{\mathcal{T}}(\mathbf{v}-\alpha\cdot\mathbf{1}),&\text{otherwise,}\end{array}\right.

m^{l + 1} = Π_{I_{S}} (m^{l} - t ▽ F (m^{l})),

m^{l + 1} = Π_{I_{S}} (m^{l} - t ▽ F (m^{l})),

M = [m_{1, 1} M_{2, 1} M_{1, 2} M_{2, 2}],

M = [m_{1, 1} M_{2, 1} M_{1, 2} M_{2, 2}],

m_{1, 1} - M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} > 0.

m_{1, 1} - M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} > 0.

M_{2, 1} min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. m_{1, 1} - M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} > 0, m_{1, 1} \leq C - tr (M_{2, 2}) .

M_{2, 1} min {i, j} \sum exp {- (f_{i} - f_{j})^{⊤} M (f_{i} - f_{j})} d_{i, j} s.t. m_{1, 1} - M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} > 0, m_{1, 1} \leq C - tr (M_{2, 2}) .

{i, j} \sum exp {- 2 g_{i, j} (1) M_{2, 1}^{⊤} g_{i, j} (2))} \tilde{d}_{i, j}

{i, j} \sum exp {- 2 g_{i, j} (1) M_{2, 1}^{⊤} g_{i, j} (2))} \tilde{d}_{i, j}

M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} = M_{2, 1}^{⊤} U Λ U^{⊤} M_{2, 1} = (U^{⊤} M_{2, 1})^{⊤} Λ (U^{⊤} M_{2, 1}),

M_{2, 1}^{⊤} M_{2, 2}^{- 1} M_{2, 1} = M_{2, 1}^{⊤} U Λ U^{⊤} M_{2, 1} = (U^{⊤} M_{2, 1})^{⊤} Λ (U^{⊤} M_{2, 1}),

m_{1, 1} > λ_{max} (U^{⊤} M_{2, 1})^{⊤} (U^{⊤} M_{2, 1}) \Rightarrow m_{1, 1} > λ_{max} M_{2, 1}^{⊤} U U^{⊤} M_{2, 1} \Rightarrow m_{1, 1} > λ_{max} M_{2, 1}^{⊤} M_{2, 1},

m_{1, 1} > λ_{max} (U^{⊤} M_{2, 1})^{⊤} (U^{⊤} M_{2, 1}) \Rightarrow m_{1, 1} > λ_{max} M_{2, 1}^{⊤} U U^{⊤} M_{2, 1} \Rightarrow m_{1, 1} > λ_{max} M_{2, 1}^{⊤} M_{2, 1},

M_{2, 1} min {i, j} \sum exp {- 2 g_{i, j} (1) M_{2, 1}^{⊤} g_{i, j} (2)} \tilde{d}_{i, j} s.t. m_{1, 1} > \frac{1}{θ _{m i n}} M_{2, 1}^{⊤} M_{2, 1} .

M_{2, 1} min {i, j} \sum exp {- 2 g_{i, j} (1) M_{2, 1}^{⊤} g_{i, j} (2)} \tilde{d}_{i, j} s.t. m_{1, 1} > \frac{1}{θ _{m i n}} M_{2, 1}^{⊤} M_{2, 1} .

S = {∥ M_{2, 1} ∥_{2}^{2} < θ_{min} m_{1, 1}},

S = {∥ M_{2, 1} ∥_{2}^{2} < θ_{min} m_{1, 1}},

I_{\mathcal{S}^{\prime}}(\mathbf{M}_{2,1})=\left\{\begin{array}[]{lr}0,&\mathbf{M}_{2,1}\in\mathcal{S}^{\prime}\\ \infty,&\text{otherwise}\end{array}\right.

I_{\mathcal{S}^{\prime}}(\mathbf{M}_{2,1})=\left\{\begin{array}[]{lr}0,&\mathbf{M}_{2,1}\in\mathcal{S}^{\prime}\\ \infty,&\text{otherwise}\end{array}\right.

M_{2, 1} min {i, j} \sum exp {- 2 g_{i, j} (1) M_{2, 1}^{⊤} g_{i, j} (2)} \tilde{d}_{i, j} + I_{S^{'}} (M_{2, 1}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

Feature Graph Learning for 3D Point Cloud Denoising

Wei Hu, Xiang Gao, Gene Cheung, and Zongming Guo W. Hu (e-mail: [email protected]), X. Gao (e-mail: [email protected]), and Z. Guo (e-mail: [email protected]) are with Wangxuan Institute of Computer Technology, Peking University, No. 128 Zhongguancun North Street, Beijing, China. G. Cheung (e-mail: [email protected]) is with York University, 4700 Keele Street Toronto, Ontario, Canada.

Abstract

Identifying an appropriate underlying graph kernel that reflects pairwise similarities is critical in many recent graph spectral signal restoration schemes, including image denoising, dequantization, and contrast enhancement. Existing graph learning algorithms compute the most likely entries of a properly defined graph Laplacian matrix $\mathbf{L}$ , but require a large number of signal observations $\mathbf{z}$ ’s for a stable estimate. In this work, we assume instead the availability of a relevant feature vector $\mathbf{f}_{i}$ per node $i$ , from which we compute an optimal feature graph via optimization of a feature metric. Specifically, we alternately optimize the diagonal and off-diagonal entries of a Mahalanobis distance matrix $\mathbf{M}$ by minimizing the graph Laplacian regularizer (GLR) $\mathbf{z}^{\top}\mathbf{L}\mathbf{z}$ , where edge weight is $w_{i,j}=\exp\{-(\mathbf{f}_{i}-\mathbf{f}_{j})^{\top}\mathbf{M}(\mathbf{f}_{i}-\mathbf{f}_{j})\}$ , given a single observation $\mathbf{z}$ . We optimize diagonal entries via proximal gradient (PG), where we constrain $\mathbf{M}$ to be positive definite (PD) via linear inequalities derived from the Gershgorin circle theorem. To optimize off-diagonal entries, we design a block descent algorithm that iteratively optimizes one row and column of $\mathbf{M}$ . To keep $\mathbf{M}$ PD, we constrain the Schur complement of sub-matrix $\mathbf{M}_{2,2}$ of $\mathbf{M}$ to be PD when optimizing via PG. Our algorithm mitigates full eigen-decomposition of $\mathbf{M}$ , thus ensuring fast computation speed even when feature vector $\mathbf{f}_{i}$ has high dimension. To validate its usefulness, we apply our feature graph learning algorithm to the problem of 3D point cloud denoising, resulting in state-of-the-art performance compared to competing schemes in extensive experiments.

Index Terms:

Graph learning, Mahalanobis distance, graph Laplacian regularizer, 3D Point cloud denoising

I Introduction

Graphs are flexible mathematical structures modeling pairwise relations between data entities, such as brain networks, social networks, computer networks and transportation networks. Nodes in a graph represent data collecting entities (e.g., users in a social group or sensors in a wireless network) while edges connecting nodes describe pairwise affinities. A scalar (weight) is often assigned to each edge, which reflects the degree of pairwise similarity between two nodes. In settings where the graph is not readily available, it is critical to first identify an appropriate underlying graph kernel—a process commonly called graph learning [1, 2, 3]—before it is used for many recent graph spectral signal restoration schemes, including image denoising, dequantization, deblurring, and contrast enhancement [4, 5, 6, 7].

Existing graph learning methods111We focus on learning of undirected graphs when one or fewer signal observation is available in this paper, while learning of directed graphs [3] is left for future work. can be roughly classified into two main categories: statistical methods and graph spectral methods. The common assumption behind statistical methods is that multiple data observations generated from the same probability model are available to determine the model parameters, which describe an underlying graph. Graph learning is then essentially the problem of estimating the inverse covariance or precision matrix given sufficient empirical data, with the addition of some prior topological information (e.g., sparsity) [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. However, the requirement of multiple signal observations is not applicable to scenarios where only a single observed signal is available.

On the other hand, graph spectral learning methods offer an alternative (or additional) signal representation perspective, where observed signals are assumed to lie in a low-dimensional subspace spanned by the low frequency components of the underlying graph topologies [18, 19, 20, 21, 22, 23, 24, 25]. Specifically, the frequency components are eigenvectors of a chosen variational operator on graphs like the adjacency matrix or the graph Laplacian matrix [26]. This “low-pass” spectral assumption translates to additional constraints during graph learning, potentially leading to more accurate graph estimates when there are few signal observations.

Extending on these previous works [21, 24, 27], in this paper we study spectral graph learning when the number of signal observations is extremely small—just one observation or even fewer (i.e., partial observation of one signal). This is typically the case for image restoration applications with non-stationary statistics [4, 5, 6, 7], where the underlying graph for a target image patch needs to be estimated for graph spectral processing given just one noisy and/or partial patch observation. To ease the ill-posedness of the problem, we assume the availability of a relevant feature vector $\mathbf{f}_{i}$ per node $i$ , $\mathbf{f}_{i}\in\mathbb{R}^{K}$ (e.g., the color pixel intensities), and that an edge weight is an inverse function of the feature distance (i.e., larger the inter-node distance, smaller the edge weight). Many previous graph constructions including bilateral filter [28, 29, 30, 4, 27] implicitly assume some notion of feature distance when assigning edge weights; our work is a more formal study of feature metric learning in a rigorous mathematical setting.

Specifically, we assume an edge weight $w_{i,j}=\exp\{-(\mathbf{f}_{i}-\mathbf{f}_{j})^{\top}\mathbf{M}(\mathbf{f}_{i}-\mathbf{f}_{j})\}$ , where $\mathbf{M}\in\mathbb{R}^{K\times K}$ is the Mahalanobis distance metric matrix [31] given feature difference $\mathbf{f}_{i}-\mathbf{f}_{j}$ for the two connected nodes $i$ and $j$ . Given a single observation $\mathbf{z}\in\mathbb{R}^{N}$ , we seek to minimize the Graph Laplacian Regularizer (GLR) [4] $\mathbf{z}^{\top}\mathbf{L}(\mathbf{M})\mathbf{z}$ using $\mathbf{M}$ , which measures the smoothness of the signal $\mathbf{z}$ with respect to the graph Laplacian matrix $\mathbf{L}$ . Note that because the feature dimension $K$ is often much smaller than signal dimension $N$ , variable $\mathbf{M}$ has $K^{2}$ entries only, while in general Laplacian matrix $\mathbf{L}$ has $N^{2}$ entries, resulting in more stable parameter estimation given the same observed data.

To find the optimal $\mathbf{M}$ , we alternately optimize the diagonal and off-diagonal entries. When optimizing diagonal entries, we derive linear inequalities based on the Gershgorin circle theorem (GCT) [32] to keep $\mathbf{M}$ positive definite (PD) and then employ proximal gradient (PG) descent [33] to acquire a solution. When optimizing off-diagonal entries, we design a block descent algorithm that iteratively optimizes one row and column of $\mathbf{M}$ until convergence. To keep $\mathbf{M}$ PD, we constrain the Schur complement of $\mathbf{M}$ to be PD according to the Haynsworth inertia additivity [34], which is enforced using PG descent. Further, we relax the PD constraint via vector-norm bounding to avoid matrix inverse computation during the optimization. Our algorithm also mitigates full eigen-decomposition of $\mathbf{M}$ , thus ensuring fast computation speed even when the feature dimension $K$ is large.

To validate the usefulness of our proposed feature graph learning algorithm, we apply it to the problem of 3D point cloud denoising. Point clouds provide efficient representation for arbitrarily-shaped objects, which consist of a set of irregularly-spaced points. The maturity of depth sensing and 3D laser scanning techniques222Commercial products include Microsoft Kinect, LiDAR, Intel RealSense, etc. enables convenient acquisition of 3D point clouds, which have a variety of applications such as 3D immersive tele-presence, navigation for autonomous vehicles, free-viewpoint rendering, and heritage reconstruction [35]. However, point clouds are often perturbed by noise, which comes from hardware, software or other causes.

To denoise point clouds, we first assume that local patches are self-similar [36, 37] and connect corresponding points into a graph. Assuming a first-order intrinsic Gaussian Markov random field (IGMRF) model [38], we pose a Maximum a Posteriori (MAP) estimation problem with GLR as signal prior. Interpreting the precision matrix in the IGMRF model as a graph Laplacian, we employ the feature graph learning to optimize edge weights, where for each point we employ 3D coordinates and surface normals as relevant features. Finally, we optimize the point cloud and the underlying graph alternately until convergence. Extensive experiments show that we achieve state-of-the-art performance compared to competing methods [39, 40, 41].

To summarize, the main contributions of our works are:

To identify an appropriate underlying graph given a single signal observation $\mathbf{z}$ , we formulate a feature graph learning problem by minimizing the GLR $\mathbf{z}^{\top}\mathbf{L}(\mathbf{M})\mathbf{z}$ using the Mahalanobis distance metric matrix $\mathbf{M}$ as variable, assuming feature vector per node is available. 2. 2.

We develop a fast block descent algorithm to optimize the feature metric matrix $\mathbf{M}$ , while keeping $\mathbf{M}$ positive definite and mitigating full matrix eigen-decomposition and large matrix inverse; 3. 3.

We employ feature graph learning to 3D point cloud denoising, where the graph for each set of self-similar patches is computed from 3D coordinates and surface normals as features, resulting in superior denoising performance.

The paper is organized as follows. We first review previous works on graph learning and point cloud denoising in Section II. Then we introduce basic concepts in graph spectral processing in Section III. In Section IV, we describe the proposed problem formulation and algorithm development for feature graph learning. We then apply to the problem of point cloud denoising in Section V. Finally, experimental results and conclusions are presented in Section VI and VII, respectively.

II Related Work

We overview previous works on graph learning and point cloud denoising in order.

II-A Graph Learning

Previous graph learning methods can be divided into two main categories: statistical methods and graph spectral methods.

Statistical methods: In graphical models including Markov random fields [38] and Bayesian networks [42], edges in the graph encode conditional dependencies among random variables represented as nodes. Learning the graph structure amounts to learning the inverse covariance or precision matrix for such models. Dempster [11] proposed to introduce zero entries in inverse covariance matrices for simplified covariance estimation. The estimation of a sparse inverse covariance matrix was then studied in several works [12, 13, 8]. Friedman et al. formulated sparse inverse covariance estimation with a regularization framework and developed the Graphical Lasso algorithm to address the regularized optimization problem [10]. Some algorithmic extensions of the Graphical Lasso are presented in [9, 14], and a few computationally efficient variations are discussed in [15, 16, 17]. However, inverse covariance estimation methods assume many observations of a graphical model, which is not applicable for many imaging applications.

Graph spectral methods: The key idea is to enforce low frequency representation of observed signals as well as constraints for a valid graph Laplacian matrix. Tenenbaum et al. [18] proposed to learn combinatorial graph Laplacians using a proposed sparse model. A regression framework was presented in [43] to learn a graph Laplacian matrix based on a fitness metric between the signals and the graph, which essentially evaluates the smoothness of the signals on the graph. Dong et al. [21] and Kalofolias [19] proposed to learn Laplacian matrices from the smoothness prior of the graph signal. Egilmez et al. [44] proposed graph learning under pre-defined graph structural and graph Laplacian constraints. Yang et al. [27] computed optimal feature weights in a similarity graph given a restored binary classifier signal. This is an earlier version of our feature metric learning, but restricts the search space only to diagonal matrices, which limits its effectiveness.

Orthogonally, some studies focus on inferring graph topologies from signals that are diffused on a graph over time. A fitness metric similar to the regression framework was employed in [45] to learn a valid graph topology. In particular, Segarra et al. [22] and Pasdeloup et al. [23] focused on learning graph shift/diffusion operators (such as adjacency matrices) from a set of diffused graph signals. Sardellitti et al. [20] proposed to learn the graph topology from data under the assumption of band-limited signals, which corresponded to signals with clustering properties. Nonetheless, this class of methods also assume a large number of signal observations for a stable estimate.

II-B Point Cloud Denoising

Point cloud denoising methods mainly include Moving Least Squares (MLS) based methods, Locally Optimal Projection (LOP) based methods, sparsity based methods, non-local based methods and graph-based methods.

MLS-based methods: MLS-based methods approximate the point cloud with a smooth surface and then project the points of the point cloud onto the fitted surface. [46] used the MLS projection operator to calculate the optimal MLS surface of the point cloud, and moved the points around the surface to the MLS surface. [47] proposed a MLS-based spherical fitting denoising method (APSS). Compared with the aforementioned MLS projection-based algorithm, this method improved the stability at low sampling rate and high curvature. [48] proposed an algorithm based on improved MLS and local kernel regression to smooth the point cloud surface (RIMLS). However, these MLS-based methods are often sensitive to outliers.

LOP-based methods: The widely known LOP [49] aimed to produce a set of points to represent the underlying surface while enforcing a uniform distribution over the point cloud. Weighted LOP (WLOP) [50] provided a more uniformly distributed output than LOP by adapting a repulse term to the local density. Further, anisotropic WLOP (AWLOP) [51] modified WLOP with an anisotropic weighting function in order to preserve sharp features better. Nevertheless, LOP-based methods often suffer from over-smoothing.

Sparsity based methods: These methods are based on sparse representation theory [52], and generally involve two phases. In the first phase, the sparse reconstruction of the cloud normals is obtained by solving the global minimization problem of sparse regularization. In [53] $l_{1}$ regularization was adopted, while [54] used $l_{0}$ regularization to seek more characteristic sparsity. In the second phase, each point position is updated by solving global $l_{1}$ (or $l_{0}$ ) minimization problem based on the reconstructed normals and local planarity hypothesis. The recently proposed method called Moving Robust Principal Components Analysis (MRPCA) [39] used weighted $l_{1}$ minimization of the point deviations from the local reference plane to preserve sharp features. However, when the noise level is high, over-smoothing or over-sharpening tends to occur [54].

Non-local based methods: These approaches exploit the non-local similarities among patches in a point cloud. In [55], an extended non-local denoising (NLD) algorithm was introduced to process point clouds, where the neighborhood of each point was described by the polynomial coefficients of the local MLS surface to compute point similarity. [56] and [57] applied a scale space scheme and non-local means denoising algorithm. [37] extended the BM3D [58] algorithm to point cloud denoising, searched similar patches globally via Iterative Closest Point (ICP) [59] algorithm, and then combined them into a collaborative group for denoising. [40] utilized patch self-similarity and optimized for a low rank (LR) dictionary representation of the extracted patches to smooth 3D patches. However, the computational complexity of such methods is often high due to the global search.

Graph-based methods: This class of methods interpret a point cloud as a signal on a graph, and perform denoising via chosen graph filters. In [60], the input point cloud was represented as a signal on a $k$ -nearest-neighbor graph and then denoised via a convex optimization problem regularized by the gradient of the point cloud on the graph. In [61], a reweighted graph Laplacian regularizer for surface normals was designed, with a general $l_{p}$ -norm fidelity term that modeled two types of additive noise. Moreover, they established a linear relationship between normals and 3D point coordinates via bipartite graph approximation for ease of optimization. [41] proposed graph Laplacian regularization (GLR) of a low dimensional manifold model (LDMM), and sought self-similar patches to denoise them simultaneously. Instead of directly smoothing the coordinates or normals of 3D points, [62] estimated a local tangent plane at each 3D point based on a graph and then reconstructed each 3D point by weighted averaging of its projections on multiple tangent planes.

Our proposed approach belongs to the family of graph-based methods. The key difference is that edge weights in our graph are not pre-defined with hand-crafted parameters, but optimized rigorously via feature metric learning given available signal(s) assumed to be smooth with respect to the graph.

III Background on Spectral Graph Theory

We first review basic concepts in spectral graph theory [26] that are essential in our feature graph learning and point cloud denoising algorithms.

III-A Graph and Graph Laplacian

We consider an undirected graph $\mathcal{G}=\{\mathcal{V},\mathcal{E},\mathbf{W}\}$ composed of a node set $\mathcal{V}$ of cardinality $|\mathcal{V}|=N$ , an edge set $\mathcal{E}$ connecting nodes, and a weighted adjacency matrix $\mathbf{W}$ . Each edge $(i,j)\in\mathcal{E}$ is associated with a non-negative weight $w_{i,j}$ which reflects the degree of similarity between nodes $i$ and $j$ .

Among different variants of Laplacian matrices, in this paper we employ the combinatorial graph Laplacian [63, 64, 30] defined as $\mathbf{L}:=\mathbf{D}-\mathbf{W}$ , where $\mathbf{D}$ is the degree matrix—a diagonal matrix where $d_{i,i}=\sum_{j=1}^{N}w_{i,j}$ .

III-B Graph Laplacian Regularizer

Graph signal refers to data that resides on the nodes of a graph, such as functionality of regions on a neural network and temperatures on a sensor network.

A graph signal $\mathbf{z}\in\mathbb{R}^{N}$ defined on a graph $\mathcal{G}$ is smooth with respect to $\mathcal{G}$ [65] if

[TABLE]

where $\epsilon$ is a small positive scalar. To satisfy (1), connected node pair $z_{i}$ and $z_{j}$ must be similar for a large edge weight $w_{i,j}$ ; for a small $w_{i,j}$ , $z_{i}$ and $z_{j}$ can differ significantly. Hence, (1) forces $\mathbf{z}$ to adapt to the topology of $\mathcal{G}$ , and is commonly called the graph Laplacian Regularizer (GLR) [29, 4]. This prior also has a frequency interpretation:

[TABLE]

where $\lambda_{k}$ is the $k$ -th eigenvalue of $\mathbf{L}$ and is commonly interpreted as the $k$ -th graph frequency, and $\alpha_{k}=\mathbf{v}_{k}^{\top}\mathbf{z}$ is the inner-product between the corresponding $k$ -th eigenvector and signal $\mathbf{z}$ . In other words, $\alpha_{k}^{2}$ is the energy in the $k$ -th graph frequency for signal $\mathbf{z}$ . Thus, a small $\mathbf{z}^{\top}\mathbf{L}\mathbf{z}$ means that most signal energies are in the low graph frequencies, or $\mathbf{z}$ is roughly low-pass.

III-C Signal-Dependent Graph Laplacian Regularizer

In the aforementioned GLR, the graph Laplacian $\mathbf{L}$ is fixed, which does not promote reconstruction of the target signal with discontinuities if the corresponding edge weights are not very small. It is thus extended to signal-dependent GLR in [5, 4, 6] by considering $\mathbf{L}(\mathbf{z})$ as a function of the graph signal $\mathbf{z}$ . Specifically, an edge weight $w_{i,j}(z_{i},z_{j})$ is an inverse function of the inter-node pixel intensity difference, e.g., $w_{i,j}(z_{i},z_{j})=\exp\{-(z_{i}-z_{j})^{2}/\sigma^{2}\}$ . The reweighted prior is defined as

[TABLE]

where $w_{i,j}(z_{i},z_{j})$ is the $(i,j)$ -th element of the corresponding adjacency matrix $\mathbf{W}$ .

It has been shown in [6, 4, 5] that minimizing the signal-dependent GLR iteratively can promote piecewise smoothness (PWS) in the reconstructed graph signal $\mathbf{z}$ , assuming that edge weights are appropriately initialized. In our more general setting, given a feature vector per node—which may include the signal intensity as one feature—our work can be considered a general case that includes signal-dependent GLR as a special case, where we compute the best feature graph via an optimization of the Mahalanobis distance metric.

IV Feature Metric Learning

IV-A Problem Formulation

Conceptually, an edge weight $w_{i,j}$ reflects the similarity between samples at nodes $i$ and $j$ ; specifically, using the commonly used Gaussian kernel [29], edge weight $w_{i,j}=\exp\left\{-\delta_{i,j}\right\}$ , where $\delta_{i,j}$ denotes the estimated feature distance between samples $i$ and $j$ . One advantage of the Gaussian kernel is that edge weight $w_{i,j}$ is in range $[0,1]$ , ensuring the resulting combinatorial graph Laplacian matrix $\mathbf{L}$ to be positive semi-definite (PSD) [66].

The feature distance between two samples essentially measures the inter-sample similarity. As one well-known example of feature distance, consider the bilateral filter in image denoising [28] that employs pixel intensities $x_{i}$ and pixel locations $\mathbf{l}_{i}$ as relevant features to compute $\delta_{i,j}$ , namely,

[TABLE]

where $\sigma_{x}$ and $\sigma_{l}$ are parameters. Defining $\mathbf{f}_{i}=[x_{i}~{}~{}\mathbf{l}_{i}]^{\top}$ , we can rewrite (4) in matrix form as:

[TABLE]

[28] shows that with appropriate parameters $\sigma_{x}$ and $\sigma_{l}$ , the bilateral filter can achieve very good edge-preserving image denoising performance. How to best determine $\sigma_{x}$ and $\sigma_{l}$ , however, was left unanswered.

More generally, associated with each sample $i$ is a length- $K$ vector of relevant features, and our goal is to compute an optimal Mahalanobis distance for the given features:

[TABLE]

where $\mathbf{M}\in\mathbb{R}^{K\times K}$ is a positive definite (PD) matrix333PD is desirable, so that if $\mathbf{f}_{i}-\mathbf{f}_{j}\neq\mathbf{0}$ , then $(\mathbf{f}_{i}-\mathbf{f}_{j})^{\top}\mathbf{M}(\mathbf{f}_{i}-\mathbf{f}_{j})>0$ .. As a special case, when $\mathbf{M}$ is a diagonal matrix with strictly positive diagonal entries, the definition in (8) defaults to that in [27]. Diagonal $\mathbf{M}$ can capture the relative importance of individual features when computing $\delta_{i,j}$ , but fails to capture possible cross-correlation among features, and thus is sub-optimal in the general case.

IV-A1 Importance of Off-diagonal Terms in $\mathbf{M}$

For completeness, we illustrate the importance of off-diagonal terms via the following analysis and example. Fundamentally, a real symmetric matrix ${\mathbf{M}}$ is normal and thus diagonalizable, i.e., it can be eigen-decomposed into the following form:

[TABLE]

where ${\bm{\Lambda}}$ is a diagonal matrix with eigenvalues $\lambda_{i}$ along its diagonal, and ${\mathbf{Q}}$ contains the corresponding eigenvectors as columns. Although the spectral theorem requires ${\mathbf{Q}}$ to be a unitary matrix, more generally, we can interpret (9) to mean that symmetric real matrix ${\mathbf{M}}$ generalizes any diagonal matrix ${\bm{\Lambda}}$ by pre- and post-multiplying it by any chosen square matrix ${\mathbf{Q}}$ and its transpose. To demonstrate the importance of this generalization, consider the following simple example. Define first the difference vector as the difference between feature vectors ${\mathbf{f}}_{i}$ and ${\mathbf{f}}_{j}$ , i.e., $\nabla{\mathbf{f}}_{i,j}={\mathbf{f}}_{i}-{\mathbf{f}}_{j}$ . Thus, given metric ${\mathbf{M}}$ , the Mahalanobis distance between nodes $i$ and $j$ is computed as $({\mathbf{f}}_{i}-{\mathbf{f}}_{j})^{\top}{\mathbf{M}}({\mathbf{f}}_{i}-{\mathbf{f}}_{j})=\left(\nabla{\mathbf{f}}_{i,j}\right)^{\top}{\mathbf{M}}\left(\nabla{\mathbf{f}}_{i,j}\right)$ . Suppose now that there are only two available features ${\mathbf{f}}_{i}=[{\mathbf{f}}_{i}^{(1)}~{}{\mathbf{f}}_{i}^{(2)}]^{\top}$ for every node $i$ . Suppose also that the optimal metric ${\mathbf{M}}$ in this case computes the difference of the two components in the difference vector $\nabla{\mathbf{f}}_{i,j}$ , i.e.,

[TABLE]

In this case, two nodes $i$ and $j$ with difference vector $\nabla{\mathbf{f}}_{i,j}=[\epsilon~{}~{}\epsilon]^{\top}$ having the same component $\nabla{\mathbf{f}}_{i,j}^{(1)}=\nabla{\mathbf{f}}_{i,j}^{(2)}=\epsilon$ will result in a Mahalanobis distance of $(\nabla{\mathbf{f}}_{i,j})^{\top}{\mathbf{M}}(\nabla{\mathbf{f}}_{i,j})=0$ , no matter how large $\epsilon$ is. On the other hand, any non-zero PSD diagonal matrix ${\bm{\Lambda}}^{\prime}=\mathrm{diag}(\lambda_{1}^{\prime},\lambda_{2}^{\prime})$ , where $\lambda_{1}^{\prime}>0$ or $\lambda_{2}^{\prime}>0$ , will lead to a distance $(\nabla{\mathbf{f}}_{i,j})^{\top}{\bm{\Lambda}}^{\prime}(\nabla{\mathbf{f}}_{i,j})=\lambda_{1}^{\prime}\epsilon^{2}+\lambda_{2}^{\prime}\epsilon^{2}=\infty$ as $\epsilon\rightarrow\infty$ . Thus, we can conclude that a diagonal-only metric ${\bm{\Lambda}}^{\prime}$ can be arbitrarily worse than the optimal metric ${\mathbf{M}}$ with off-diagonal terms, and off-diagonal terms for metric ${\mathbf{M}}$ are essential in computing feature distances.

To demonstrate that the above 2-feature example is not contrived, consider the following concrete application. Suppose the first and second features, ${\mathbf{f}}_{i}^{(1)}$ and ${\mathbf{f}}_{i}^{(2)}$ , measure the $x$ -location of a train on a line track at time [math] and time $t>0$ , respectively. Suppose the optimal metric ${\mathbf{M}}$ considers only the difference in velocity, ${\mathbf{v}}_{i}-{\mathbf{v}}_{j}$ , of the two trains $i$ and $j$ , i.e.,

[TABLE]

which is the same as (10). Clearly, if the two trains $i$ and $j$ have the same velocity, the Mahalanobis distance between them should be zero, regardless of their difference in start / end locations. Using any diagonal-only metric, however, would compute a Mahalanobis distance that is a function of the difference between their start / end locations, which is incorrect.

IV-A2 Formulation

We can now pose an optimization problem for $\mathbf{M}$ with GLR (3) as objective: we seek the optimal metric $\mathbf{M}$ that yields the smallest GLR term given feature vector $\mathbf{f}_{i}$ for each node $i$ , and one (or more) signal observation(s) $\mathbf{z}$ . Specifically, denote by $d_{i,j}=(z_{i}-z_{j})^{2}$ the inter-node sample difference square of observation $\mathbf{z}$ , we have

[TABLE]

If there are more than one signal observations $\mathbf{z}^{1},\ldots,\mathbf{z}^{S}$ available, then $d_{i,j}$ in (13) can be easily generalized to be the sum of inter-node sample difference squares of all observations, i.e. $d_{i,j}=\sum_{s=1}^{S}(z_{i}^{s}-z_{j}^{s})^{2}$ .

Minimizing (13) directly would lead to one pathological solution, i.e., $m_{i,i}=\infty,\forall i$ , resulting in edge weights $w_{i,j}=0$ . Topologically, this means nodes in the graph are all isolated, defeating the goal of finding a similarity graph. To avoid this solution, we constrain the trace of $\mathbf{M}$ to be smaller than a constant parameter $C$ , resulting in

[TABLE]

One naïve approach to the optimization problem in (14) using proximal gradient descent [33] is as follows. The objective $q(\mathbf{M})$ in (14) itself is convex and differentiable with respect to $\mathbf{M}$ , and thus a gradient descent step $\nabla q(\mathbf{M})$ can be computed. The constraints in (14) describe a feasible solution space that is a convex cone of all PD matrices with trace upper-bounded by $C$ . One can thus rewrite the constraints as a second objective term $h(\mathbf{M})$ that evaluates to 0 if $\mathbf{M}$ is in the convex set and $\infty$ otherwise. This convex but non-differentiable objective term $h(\mathbf{M})$ has the following proximal mapping $\mathrm{prox}_{h}(\mathbf{M})$ : orthogonally project the eigenvalues $\lambda_{k}$ ’s of $\mathbf{M}$ into the convex set: $\lambda_{k}>0,\forall k$ , $\sum_{k}\lambda_{k}\leq C$ . This results in a proximal gradient step: $\mathbf{M}_{t+1}=\mathrm{prox}_{h}\left(\mathbf{M}_{t}-\gamma\nabla q(\mathbf{M}_{t})\right)$ , where $\gamma$ is a chosen step size.

However, this naïve realization of proximal gradient requires eigen-decomposition of $\mathbf{M}_{t}$ per iteration $t$ with complexity $O(n^{3})$ , which is computation-expensive for large $\mathbf{M}_{t}$ . To completely circumvent eigen-decomposition, we rewrite the PD cone constraint as a set of linear constraints, which form another convex set (a polytope) that is much easier to solve. In particular, our strategy is to optimize $\mathbf{M}$ ’s diagonal and off-diagonal entries alternately until convergence. We discuss the two optimizations in order next.

IV-B Optimization of Diagonal Entries

When $\mathbf{M}$ ’s diagonal entries are optimized, (14) can be simplified as follows. Let $\mathbf{g}_{i,j}=\mathbf{f}_{i}-{\mathbf{f}}_{j}$ , $\mathbf{g}_{i,j}\in\mathbb{R}^{K}$ . Further, let matrix $\mathbf{M}^{\prime}$ be $\mathbf{M}$ with only the off-diagonal entries, i.e., $\mathbf{M}^{\prime}=\mathbf{M}-\mathrm{diag}(\mathbf{M})$ . Then,

[TABLE]

IV-B1 Gershgorin-based reformulation

To enforce the positive definiteness constraint $\mathbf{M}\succ 0$ using simple linear constraints, we leverage on the Gershgorin circle theorem (GCT) [32] and constrain each Gershgorin disc $\Psi_{i}$ corresponding to each row $i$ of $\mathbf{M}$ to reside in strictly positive territory. Specifically, disc $\Psi_{i}$ has center $c_{i}=m_{i,i}$ and radius $r_{i}=\sum_{j\neq i}|m_{i,j}|$ . To keep $\Psi_{i}$ in positive territory, we need the left-end $c_{i}-r_{i}$ to be positive. (15) thus becomes:

[TABLE]

We can further simplify the objective by defining $\tilde{d}_{i,j}=\exp\{-\mathbf{g}_{i,j}^{\top}\mathbf{M}^{\prime}\mathbf{g}_{i,j}\}d_{i,j}$ , resulting in

[TABLE]

IV-B2 Proximal Gradient algorithm

To solve (17) efficiently, we employ a proximal gradient (PG) approach [33]. Let $\mathbf{m}=[m_{1,1},...,m_{K,K}]^{\top}$ . The linear constraints for $\mathbf{m}$ form a convex set:

[TABLE]

Then, we define the indicator function $I_{\mathcal{S}}(\mathbf{m})$ :

[TABLE]

We now rewrite the optimization for $\mathbf{m}$ as an unconstrained problem by exchanging the convex set constraint with indicator function $I_{\mathcal{S}}(\mathbf{m})$ in the objective:

[TABLE]

The first term is convex with respect to $\mathbf{m}$ and differentiable, while the second term $I_{\mathcal{S}}(\mathbf{m})$ is convex but non-differentiable. we can thus employ PG to solve (20) as follows.

We first compute the gradient of the first term with respect to $\mathbf{m}$ :

[TABLE]

We next define a proximal mapping $\Pi_{I_{\mathcal{S}}}(\mathbf{v})$ for the second term—indicator function $I_{\mathcal{S}}(\mathbf{v})$ —which is a projection onto the convex set $\mathcal{S}$ , i.e.,

[TABLE]

where $P_{\mathcal{T}}(\mathbf{v})=\max\left\{v_{i},\sum_{j\neq i}|m_{i,j}|\right\}_{i=1}^{K}$ , and $\alpha$ is any positive root of $\mathbf{1}^{\top}P_{\mathcal{T}}(\mathbf{v}-\alpha\cdot\mathbf{1})=C$ [67].

Each iteration of the PG algorithm can be now written as:

[TABLE]

where $t$ is the step size. As discussed in [33], the algorithm will converge with rate $O(1/l)$ for a fixed step size $t^{l}=t\in(0,2/L]$ , where $L$ is a Lipschitz constant that requires computation of the Hessian of $F$ . In our experiment, we choose a small step size $t$ empirically, which is small enough to satisfy the Lipschitz smoothness of the objective function. In the first iteration, we initialize $\mathbf{M}$ to be a diagonal matrix with each diagonal entry $m_{i,i}=C/K$ , thus ensuring $\mathbf{M}$ is PD and $\mathrm{tr}(\mathbf{M})\leq C$ .

Further, we may reduce the complexity of the proposed algorithm via accelerated proximal gradient (APG) [68, 69]. APG is able to accelerate the convergence, by first extrapolating a point from the current point and the previous point and then performing a proximal gradient step. We leave this as our future work.

IV-C Optimization of Off-diagonal Entries

For off-diagonal entries of $\mathbf{M}$ , we develop a block coordinate descent algorithm, which optimizes one row / column at a time.

IV-C1 Block Coordinate Iteration

First, we divide $\mathbf{M}$ into four sub-matrices:

[TABLE]

where $m_{1,1}\in\mathbb{R}$ , $\mathbf{M}_{1,2}\in\mathbb{R}^{1\times(K-1)}$ , $\mathbf{M}_{2,1}\in\mathbb{R}^{(K-1)\times 1}$ and $\mathbf{M}_{2,2}\in\mathbb{R}^{(K-1)\times(K-1)}$ . The assumption that $\mathbf{M}$ is symmetric means $\mathbf{M}_{1,2}=\mathbf{M}_{2,1}^{\top}$ . Our strategy is to optimize one row and column of off-diagonal entries represented by $\mathbf{M}_{2,1}$ in one iteration, given objective and constraints in (14). In the next iteration, a different row and column is selected, and with appropriate rows and columns reordering, the optimization variable $\mathbf{M}_{2,1}$ can still reside in the first row and column as shown in (24).

By the Haynsworth inertia additivity [34], a symmetric real matrix $\mathbf{M}$ is PD if and only if both its sub-matrix $\mathbf{M}_{2,2}$ and its corresponding Schur complement $m_{1,1}-\mathbf{M}_{2,1}^{\top}\mathbf{M}_{2,2}^{-1}\mathbf{M}_{2,1}$ are PD. Hence, we can ensure $\mathbf{M}$ is PD by constraining the Schur complement $m_{1,1}-\mathbf{M}_{2,1}^{\top}\mathbf{M}_{2,2}^{-1}\mathbf{M}_{2,1}$ to be positive, given that matrix $\mathbf{M}$ , and therefore sub-matrix $\mathbf{M}_{2,2}$ , are both PD from the previous iteration.

In the first iteration, we initialize $\mathbf{M}$ to be a diagonal matrix with diagonal entries as optimized in Sec. IV-B. In each subsequent iteration, we impose a positivity constraint on the Schur complement of a submatrix $\mathbf{M}_{2,2}$ as follows:

[TABLE]

Optimization problem (14) thus becomes:

[TABLE]

Given $\mathbf{M}_{2,2}$ is fixed in (26), we can simplify the objective as follows. Writing $\mathbf{M}$ in terms of its four sub-matrices in (24), we simplify the objective via matrix multiplication as

[TABLE]

where the reused notation $\tilde{d}_{i,j}=\exp\{-\mathbf{g}_{i,j}(1)^{2}m_{1,1}-\mathbf{g}_{i,j}(2)^{\top}\mathbf{M}_{2,2}\mathbf{g}_{i,j}(2)\}d_{i,j}$ is a constant as $m_{1,1}$ and $\mathbf{M}_{2,2}$ are fixed in the iteration. $\mathbf{g}_{i,j}(1)$ denotes the first entry in vector $\mathbf{g}_{i,j}$ , and $\mathbf{g}_{i,j}(2)\in\mathbb{R}^{K-1}$ denotes the remaining entries.

IV-C2 $\lambda_{\max}$ -bounded Reformulation

Computation of a large matrix inverse $\mathbf{M}_{2,2}^{-1}$ in (26) per iteration is costly. To avoid computing $\mathbf{M}_{2,2}^{-1}$ , we derive a bound based on the largest eigenvalue $\lambda_{\max}$ of $\mathbf{M}_{2,2}^{-1}$ to ensure the positivity constraint on the Schur complement $m_{1,1}-\mathbf{M}_{2,1}^{\top}\mathbf{M}_{2,2}^{-1}\mathbf{M}_{2,1}$ in (26) is satisfied.

First, since $\mathbf{M}_{2,2}^{-1}$ is a real and symmetric PD matrix, it admits diagonalization with eigen-matrix $\mathbf{U}$ (eigenvectors as columns) and diagonal matrix $\bm{\Lambda}$ with real eigenvalues $0<\lambda_{1}\leq\ldots\leq\lambda_{K}$ along its diagonal. Hence,

[TABLE]

which is essentially scaling the $l_{2}$ -norm of $\mathbf{U}^{\top}\mathbf{M}_{2,1}$ by eigenvalues in $\bm{\Lambda}$ . Hence, a sufficient condition to the first constraint in (26) is to bound with the maximum eigenvalue $\lambda_{\text{max}}$ of $\mathbf{M}_{2,2}^{-1}$ :

[TABLE]

$\lambda_{\text{max}}$ is the reciprocal of the minimum eigenvalue $\theta_{\text{min}}$ of $\mathbf{M}_{2,2}$ , i.e., $\lambda_{\text{max}}=\frac{1}{\theta_{\text{min}}}$ . We employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [70] to calculate $\theta_{\text{min}}$ , which is efficient to compute the extreme eigen-pairs of a large sparse matrix with linear convergence.

Having computed $\theta_{\min}$ from $\mathbf{M}_{2,2}$ without eigen-decomposition, we can reformulate (26) as

[TABLE]

Next, we design an efficient algorithm to address (30).

IV-C3 PG Algorithm

In each iteration of the block descent algorithm, we hold the diagonal entries fixed and optimize $\mathbf{M}_{2,1}$ in (30).

The objective is convex while the lone constraint for $\mathbf{M}_{2,1}$ forms a convex set. In particular, the constraint in (30) reduces to

[TABLE]

which is a $(K-1)$ -dimensional norm ball with radius $\sqrt{\theta_{\text{min}}m_{1,1}}$ . We now define an indicator function $I_{\mathcal{S}^{\prime}}(\mathbf{M}_{2,1})$ :

[TABLE]

We can rewrite the optimization for $\mathbf{M}_{2,1}$ as an unconstrained problem by exchanging the convex set constraint with the indicator function $I_{\mathcal{S}^{\prime}}(\mathbf{M}_{2,1})$ in the objective:

[TABLE]

The first term is convex with respect to $\mathbf{M}_{2,1}$ and differentiable, while the second term $I_{\mathcal{S}^{\prime}}(\mathbf{M}_{2,1})$ is convex but non-differentiable. Hence, we employ again the PG algorithm to solve (33).

Specifically, we first compute the gradient of the first term $F$ in the objective with respect to $\mathbf{M}_{2,1}$ as

[TABLE]

which will be adopted in the step of gradient descent. We define a proximal mapping $\Pi_{I_{\mathcal{S}^{\prime}}}(\mathbf{v})$ for $\mathbf{v}\in\mathbb{R}^{(K-1)\times 1}$ , which is a projection onto the norm ball with radius $\sqrt{\theta_{\text{min}}m_{1,1}}$ :

[TABLE]

Then each iteration in the PG algorithm can be written as

[TABLE]

where $t$ is the step size as discussed in Sec. IV-B. We may also deploy APG to reduce the complexity further.

Finally, we analyze the convergence of our algorithm. The proposed alternating optimization algorithm optimizes diagonal and off-diagonal terms in $\mathbf{M}$ in turn. When computing a solution for diagonal or off-diagonal terms, we adopt the newly computed solution only if the objective strictly decreases. Further, using an exponential kernel to compute edge weights means that any metric $\mathbf{M}$ would always result in non-negative edge weights. This results in a PSD graph Laplacian $\mathbf{L}$ [66], and our GLR objective (14) is lower-bounded by [math]. Hence, our algorithm strictly decrements an objective iteratively that is lower-bounded by [math], and our algorithm converges to a locally optimal solution.

V Feature Metric Learning for Point Cloud Denoising

Having described our feature graph learning scheme, we now employ it for 3D point cloud denoising. We first propose a patch-based model and designate graph connectivities over similar patches in a neighborhood. Then we formulate an inverse problem for point cloud denoising with GLR for regularization, which can be interpreted alternatively assuming an IGMRF model. Finally, we develop an alternating algorithm to efficiently solve the formulated problem.

V-A Patch-based Model

We assume an additive noise model for a point cloud, namely

[TABLE]

where $\mathbf{P}\in\mathbb{R}^{N\times 3}$ denotes the observed noise-corrupted 3D coordinates of the target point cloud with $N$ points, $\mathbf{Y}\in\mathbb{R}^{N\times 3}$ is the ground truth 3D coordinates of the point cloud, and $\mathbf{E}\in\mathbb{R}^{N\times 3}$ is an additive noise.

Modeling of $\mathbf{E}$ depends on the actual point cloud acquisition mechanism. There exist a wide range of point cloud acquisition systems at different price points—from consumer-level depth sensors like Intel RealSense444https://www.intelrealsense.com/ costing 150 USD to high-end outdoor scanners like Teledyne Optech555https://www.teledyneoptech.com/en/products/static-3d-survey/ that cost up to 250,000 USD—and defining accurate noise models for all of them is difficult. We thus select the most common Gaussian noise model, which has been shown to be reasonably accurate for popular depth cameras like Microsoft Kinect [71, 72]. Hence, $\mathbf{E}$ in (37) represents zero-mean additive white Gaussian noise (AWGN) with standard deviation $\sigma$ , i.e.,

[TABLE]

In image denoising, a simple and effective assumption is self-similarity (also known as nonlocal means (NLM)) [36, 58, 37]: similar pixel patches exist throughout the same image, which can be searched and gathered for joint denoising. To exploit self-similarity also in point clouds, we also divide $\mathbf{P}$ into patches. However, defining inter-patch similarity in a point cloud is not straightforward, because a point cloud is a collection of irregularly sampled points in 3D space666[41] formally defined patch similarity based on projections on parallel planes, but it is computation-intensive.. Instead, our work circumvents explicit search for similar patches.

Mimicking a fixed-size image patch with a center pixel, we first define a patch $\mathbf{V}_{i}$ in a point cloud $\mathbf{P}$ as done in [41]:

Definition 1. A patch $\mathbf{V}_{i}$ in a point cloud is a local point set of $k+1$ points, consisting of a center point $\mathbf{c}_{i}\in\mathbb{R}^{3}$ and its $k$ -nearest-neighbors in terms of Euclidean distance.

We divide the input point cloud $\mathbf{P}$ into a set of $M$ overlapping patches, each with a center point $\mathbf{c}_{i}\in\mathbf{P}$ . The patch centers are selected from a subset of points in $\mathbf{P}$ . Intuitively, uniformly distributed center points are preferred, since they efficiently cover the entire point cloud. Hence, we employ the uniform sampling method in [73]. $M$ is empirically set, where $M\leq N$ and $(k+1)M\geq N$ .

Next, for each center point $\mathbf{c}_{i}$ we construct a patch by identifying $\mathbf{c}_{i}$ ’s $k$ nearest neighbors. To exploit the assumed similarity among patches for denoising, we align patches via translation so each patch has its center at the origin. This results in patch set $\mathbf{V}\in\mathbb{R}^{(k+1)M\times 3}$ :

[TABLE]

where $\mathcal{T}\in\{0,1\}^{(k+1)M\times N}$ is a selection matrix to choose points from $\mathbf{Y}$ to form $M$ patches, each with $k+1$ points. Specifically, each row in $\mathcal{T}$ contains only 0s except one 1 to choose one point in $\mathbf{Y}$ . $\mathbf{C}=\{\mathbf{c}_{i}\}_{i=1}^{M}\in\mathbb{R}^{(k+1)M\times 3}$ denotes the coordinates of patch centers.

V-B Proposed Graph Connectivity

We connect two adjacent patches as follows. Two patches are considered adjacent if their centers are $k$ -nearest neighbors. Specifically, we use the $k$ -nearest-neighbor ( $k$ NN) algorithm to search the nearest $\varepsilon$ patches of each patch as the neighbors based on the Euclidean distance between patch centers. Then we build a graph over each pair of adjacent patches. Overall, this leads to a $\varepsilon$ -nearest-patch graph on the entire point cloud.

Each point in one patch is then connected to its corresponding point in the other patch. For simplicity, we treat a pair of points in adjacent patches as corresponding points if their coordinates relative to their respective centers are closest to each other. Namely, for each point $\mathbf{p}_{i}\in\mathbf{V}_{s}$ , we search the corresponding nearest point $\mathbf{p}_{j}\in\mathbf{V}_{t}$ , which has the smallest Euclidean distance to $\mathbf{p}_{i}$ :

[TABLE]

We do not explicitly connect points within the same patch, though two points in a patch may nonetheless be connected because they are also corresponding points in two different patches due to patch overlaps, as shown in Fig. 1.

We note that this is one possible graph connectivity among many. For example, one can in addition enable intra-patch filtering by drawing connections among points in the same patch [61, 62], resulting in a denser graph. For simplicity, we employ the most basic graph connectivity given inter-patch similarities. More sophisticated graph constructions considering also intra-patch filtering is left for future work.

Next, we formulate the problem of point cloud denoising via MAP estimation based on the chosen graph connectivities.

V-C MAP Formulation for Point Cloud Denoising

We pose a MAP estimation problem for the underlying patches $\mathbf{V}$ : given the observed patches $\hat{\mathbf{V}}$ , find the most probable signal $\mathbf{V}$ ,

[TABLE]

where $f(\hat{\mathbf{V}}\mid\mathbf{V})$ is the likelihood function, and $g(\mathbf{V})$ is the prior probability distribution of $\mathbf{V}$ .

V-C1 Likelihood function

Since patches are extracted from the observed point cloud, $f(\hat{\mathbf{V}}\mid\mathbf{V})$ is equivalent to $f(\mathbf{P}\mid\mathbf{Y})$ . We thus define the likelihood function according to the additive Gaussian noise model in (37) and (38):

[TABLE]

where $\alpha=1/(2\sigma^{2})$ is a parameter.

V-C2 Prior probability distribution

We employ signal-dependent GLR as the prior, namely,

[TABLE]

where $\beta$ is a parameter, and $\mathbf{M}$ is the Mahalanobis distance matrix satisfying constraints in (14). This prior essentially enforces the smoothness of patch signals $\mathbf{V}$ with respect to the underlying graph $\mathbf{L}$ .

We provide an alternative interpretation of GLR next. Specifically, we model the self-similarity among chosen patches in point clouds via first-order IGMRFs on irregular lattices. The definition of IGMRF is as follows [38]:

Definition 2. Let $\mathbf{Q}$ be an $n\times n$ symmetric positive semi-definite matrix with rank $n-r>0$ . Then $\mathbf{x}=(x_{1},...,x_{n})^{\top}$ is an intrinsic GMRF of order $r\geq 0$ with parameters $(\mathbf{\mu},\mathbf{Q})$ , if its density is

[TABLE]

where $|\cdot|^{*}$ denotes the generalized determinant (the product of non-zero eigenvalues). An intrinsic GMRF of order $r\geq 0$ is also known as an improper GMRF of rank $n-r$ .

Further, $\mathbf{x}$ is an intrinsic GMRF with respect to a graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , where

[TABLE]

In our context, we only consider the case where the graph $\mathcal{G}$ is loopless and connected, leading to first-order IGMRF modelling of point clouds (i.e., $r=1$ ) [26].

Specifically, for patches translated to the origin $\mathbf{V}=\mathcal{T}\mathbf{Y}-\mathbf{C}$ , we model the difference between corresponding points $\mathbf{p}_{i}$ and $\mathbf{p}_{j}$ in adjacent patches. As the coordinates along each axis are independent, we consider each component separately. Taking the $x$ -axis coordinate $x_{i}$ and $x_{j}$ for instance, we define a normal increment

[TABLE]

where $\kappa$ is a precision parameter, and $w_{i,j}$ is a positive and symmetric weight we incorporate for each pair of neighboring nodes $i$ and $j$ . The joint density then becomes

[TABLE]

As derived in [38], the corresponding precision matrix has the following form:

[TABLE]

Hence, under the first-order IGMRF model on irregular lattices where we assume normal increments between corresponding points, the prior distribution of $\mathbf{V}$ is

[TABLE]

where $\beta=(2\pi)^{-\frac{n-1}{2}}(|\mathbf{Q}|^{*})^{\frac{1}{2}}$ . Here $\mathbf{Q}$ has the specific form as in (48).

Further, comparing the specific form of $\mathbf{Q}$ in (48) and the definition of the combinatorial graph Laplacian $\mathbf{L}$ in Sec. III-A, we have

[TABLE]

Hence, we replace $\mathbf{Q}$ in the quadratic term in (49) with $\mathbf{L}$ and consider all the three components, which leads to the GLR prior in (43) and thus an alternative perspective of GLR under first-order IGMRF.

V-C3 Final formulation for point cloud denoising

Combining (39), (41), (42) and (43), we have

[TABLE]

We rewrite the objective by taking the logarithm of (51) and multiplying by $-1$ . Also, with the constraints of $\mathbf{M}$ considered, we have the final problem formulation for point cloud denoising

[TABLE]

where $\gamma=\beta/\alpha$ .

Next, we develop an alternating algorithm to solve (52).

V-D Proposed Algorithm for Point Cloud Denoising

We propose to address (52) by alternately optimizing the point cloud $\mathbf{Y}$ and the Mahalanobis distance matrix $\mathbf{M}$ . The iterations terminate when the difference in the objective between two consecutive iterations stops decreasing. Parameter settings are presented in Section VI.

V-D1 Optimizing the point cloud $\mathbf{Y}$

In the first iteration, we initialize $\mathbf{M}$ with an identity matrix and thus fix $\mathbf{L}$ in (52). Next, taking the derivative of (52) with respect to the three components of $\mathbf{Y}$ , $\{\mathbf{Y}_{x},\mathbf{Y}_{y},\mathbf{Y}_{z}\}$ , we have

[TABLE]

where $\mathbf{I}$ is an identity matrix. (53) can be treated as three linear equation sets and thus can be efficiently solved using conjugate gradient (CG) methods, such as the LSQR algorithm [74]. The acquired solution of $\mathbf{Y}$ is then used to update $\mathbf{M}$ in the subsequent iteration.

V-D2 Optimizing the Mahalanobis distance matrix $\mathbf{M}$

When $\mathbf{Y}$ is fixed, the optimization of $\mathbf{M}$ is feature metric learning in Sec. IV. $\mathbf{L}$ is then computed from optimized $\mathbf{M}$ by definition and our edge weight kernel.

Specifically, we consider two features: Cartesian coordinates and normals. As done in [61, 62], surface normals are adopted to promote smoothness of the underlying surface on which the point clouds are discrete samples. Along with coordinates, we thus form a $6$ -dimensional feature vector at each point $i$ , i.e., $\mathbf{f}_{i}=[x_{i},y_{i},z_{i},n_{x}^{i},n_{y}^{i},n_{z}^{i}]^{\top}$ , where $[x_{i},y_{i},z_{i}]$ denotes the coordinates of point $i$ , and $[n_{x}^{i},n_{y}^{i},n_{z}^{i}]$ denotes its normal vector. Together with one observation $\mathbf{P}$ with coordinates $\{\mathbf{p}_{i}\}_{i=1}^{N}$ , these per-node feature vectors $\mathbf{f}_{i}$ are used for feature metric learning of matrix $\mathbf{M}$ as described in Section IV.

The inter-node sample difference square $d_{i,j}$ in (14) now denotes the squared Euclidean distance between the coordinates of $\mathbf{p}_{i}$ and $\mathbf{p}_{j}$ , namely,

[TABLE]

where $i\sim j$ denotes $\mathbf{p}_{i}$ and $\mathbf{p}_{j}$ are corresponding points that are connected.

A flowchart of the proposed feature graph learning for point cloud denoising is demonstrated in Fig. 2, and an algorithmic summary is presented in Algorithm 1.

VI Experimental Results

VI-A Experimental Setup

We evaluate feature graph learning for point cloud denoising by testing on several point cloud datasets:

Face model [50] and Iron Vise model [51], which are raw scans from laser scanners exhibiting real-world noise but without ground truth;
the surface reconstruction benchmark models including five point clouds [75] and Fandisk model [37], which are clean point clouds as ground truth. We add Additive White Gaussian Noise (AWGN) to benchmark with a range of standard deviation $\sigma=\{0.02,0.03,0.04,0.05,0.10\}$ for extensive objective comparison. AWGN with $\sigma=0.005$ is added to Fandisk for subjective comparison. For the choice of $C$ in (14), we first select a median value of $3$ , and search around this value at a step size of 0.1 to obtain good results for each dataset. We observe that the denoising performance of our algorithm is relatively insensitive to $C$ experimentally.

We compare the proposed approach with seven competing point cloud denoising algorithms, including two MLS-based methods APSS [47] and RIMLS [48], one LOP-based method AWLOP [51], one sparsity based method MRPCA [39], two non-local based methods NLD [55] and LR [40], and one graph-based method GLR [41]. The implementation details are as follows. We employ the toolbox of the MeshLab software [76] to run APSS and RIMLS. We try each filter scale in the range $[3,10]$ and choose one with the best result. For AWLOP, we use its function in the EAR software [51], and choose the repulsion force in $\{0.3,0.4,0.5\}$ and the filter iteration in the range $[2,5]$ to acquire one best solution. The source codes of MRPCA and GLR are provided by the authors. We try data fitting iterations in the range $[2,6]$ for MRPCA, and follow the parameter settings in [41] for GLR. We implement NLD and LR in MATLAB, and follow the default settings in their papers. The Diagonal approach is implemented by ourselves in MATLAB.

Further, to validate the necessity of optimizing edge weights, we compare against two Baseline schemes of our method: 1) Baseline1, where the edge weights are randomly set in range $[0,1]$ instead of optimizing via feature graph learning; 2) Baseline2, where the edge weights are calculated from feature vectors using an exponential kernel.

VI-B Experimental Results

VI-B1 Demonstration of Iterations

To show the fast convergence speed of our algorithm, we demonstrate the denoising results of Iron Vise model in every two iterations. We set the weighting parameter of GLR $\gamma=0.2^{i}\cdot(e-1)^{1-i}$ in (52), where $e$ is the natural logarithmic base and $i$ is the iteration index starting from $1$ . $\gamma$ decreases with iterations so as to prevent over-smoothing. As presented in Fig. 3, as the number of iterations increases, the point cloud gradually becomes smoother, until it almost converges at iteration 7.

VI-B2 Objective Comparison

We measure the quality of denoised results for Benchmark models by the Mean Squared Error (MSE) and Signal-to-Noise Ratio (SNR) between each denoised point cloud and the ground truth as in [62]. Numerical results are listed in Table III and Table IV, respectively.

Both tables show that our method outperforms all the other competing approaches at various noise levels in general, especially at high noise level $\sigma=0.04$ . Also, we outperform the Baseline scheme with random weights, which validates optimizing edge weights is essential.

Further, to demonstrate the case of learning from partial observation of one signal, we randomly sample a subset of points ( $20\%$ ) in each point cloud for the learning of the feature metric $\mathbf{M}$ at noise level $\sigma=0.05$ . On one hand, we achieve comparable results with learning from the entire single observation as listed in Table III and Table IV. On the other hand, we compare with the diagonal-only method in [27], where only a diagonal feature metric is learned from $20\%$ of points. Results show that we still outperform [27] in the circumstance of learning feature metric from partial observation. This validates the effectiveness of our method even when extending to the case of partial observation of one signal.

VI-B3 Comparison with Vanilla Proximal Gradient

Moreover, we compare with the naïve realization of proximal gradient Vanilla PG as discussed in Section IV-A to solve (14). As presented in Table V, while our proposed algorithm approximates the original search space in Vanilla PG by rewriting the PD constraint, our point cloud denoising results are very close to the performance by Vanilla PG. This validates the effectiveness of our optimization approximation.

VI-B4 Subjective Comparison

Fig. 4 shows visual results of Quasimoto model in Benchmark without surface reconstruction for details. We compare with other denoising approaches at noise level $\sigma=0.04$ . It can be seen that our results preserve structural details well, even for tiny components such as the cigarette in Fig. 4. In comparison, the cigarette is distorted in all the other reconstruction results. Also, points in our results are more uniformly distributed than the others, even for noise with large variance.

In Fig. 5, we add AWGN to Fandisk model with standard deviation $\sigma=0.005$ . We see that our method preserves sharp features, while the other approaches result in smoothed edges to various extent. Meanwhile, our method reconstructs smooth surfaces well, such as the bottom surface as presented in the first row of Fig. 5.

Finally, we evaluate on real-world noisy point clouds Iron Vise and Face acquired from laser scanners. Typical imperfections associated with digital scans, such as noise, non-uniform point distribution, or missing data, are ubiquitous in these datasets. As shown in Fig. 3 and Fig. 6, our method is able to keep local details and sharp edges while attenuating noise significantly.

VI-C Limitation

The limitation of the proposed algorithm includes two aspects.

•

The optimal choice of the parameter $C$ in (14)—the upper bound of the trace of the feature metric $\mathbf{M}$ —is not obvious. We choose it empirically in the experiments, and found that our algorithm’s denoising performance is relatively insensitive to $C$ experimentally.

•

The requirement of having a complete feature vector per graph node may not be practical for some applications.

VII Conclusion

We study feature graph learning to identify an appropriate underlying graph given a single signal observation. Assuming the availability of relevant features per node, we formulate the problem as minimization of Graph Laplacian Regularizer using the Mahalanobis distance matrix $\mathbf{M}$ as variable. We develop a fast algorithm to alternately optimize diagonal and off-diagonal entries of $\mathbf{M}$ , while keeping $\mathbf{M}$ positive definite. We mitigate full matrix eigen-decomposition and large matrix inverse for fast computation. To validate the effectiveness of the proposed feature graph learning, we employ it for 3D point cloud denoising with 3D coordinates and surface normals as features, leading to state-of-the-art performance.

Bibliography76

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. B. Giannakis, Y. Shen, and G. V. Karanikolas, “Topology identification and learning over graphs: Accounting for nonlinearities and dynamics,” Proceedings of the IEEE , vol. 106, no. 5, pp. 787–807, May 2018.
2[2] X. Dong, D. Thanou, M. Rabbat, and P. Frossard, “Learning graphs from data: A signal representation perspective,” IEEE Signal Processing Magazine , vol. 36, no. 3, pp. 44–63, 2019.
3[3] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro, “Connecting the dots: Identifying network structure via graph signal processing,” IEEE Signal Processing Magazine , vol. 36, no. 3, pp. 16–43, May 2019.
4[4] J. Pang and G. Cheung, “Graph laplacian regularization for image denoising: Analysis in the continuous domain,” IEEE Transactions on Image Processing (TIP) , vol. 26, no. 4, pp. 1770–1785, 2017.
5[5] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph laplacian-based smoothness prior for soft decoding of jpeg images,” IEEE Transactions on Image Processing (TIP) , vol. 26, no. 2, pp. 509–524, 2016.
6[6] Y. Bai, G. Cheung, X. Liu, and W. Gao, “Graph-based blind image deblurring from a single photograph,” IEEE Transactions on Image Processing (TIP) , vol. 28, no. 3, pp. 1404–1418, 2019.
7[7] X. Liu, G. Cheung, X. Ji, D. Zhao, and W. Gao, “Graph-based joint dequantization and contrast enhancement of poorly lit jpeg images,” IEEE Transactions on Image Processing (TIP) , vol. 28, no. 3, pp. 1205–1219, 2019.
8[8] N. Meinshausen and P. Bühlmann, “High-dimensional graphs and variable selection with the lasso,” The Annals of Statistics , vol. 34, no. 3, pp. 1436–1462, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Feature Graph Learning for 3D Point Cloud Denoising

Abstract

Index Terms:

I Introduction

II Related Work

II-A Graph Learning

II-B Point Cloud Denoising

III Background on Spectral Graph Theory

III-A Graph and Graph Laplacian

III-B Graph Laplacian Regularizer

III-C Signal-Dependent Graph Laplacian Regularizer

IV Feature Metric Learning

IV-A Problem Formulation

IV-A1 Importance of Off-diagonal Terms in M\mathbf{M}M

IV-A2 Formulation

IV-B Optimization of Diagonal Entries

IV-B1 Gershgorin-based reformulation

IV-B2 Proximal Gradient algorithm

IV-C Optimization of Off-diagonal Entries

IV-C1 Block Coordinate Iteration

IV-C2 λmax⁡\lambda_{\max}λmax​-bounded Reformulation

IV-C3 PG Algorithm

V Feature Metric Learning for Point Cloud Denoising

V-A Patch-based Model

V-B Proposed Graph Connectivity

V-C MAP Formulation for Point Cloud Denoising

V-C1 Likelihood function

V-C2 Prior probability distribution

V-C3 Final formulation for point cloud denoising

V-D Proposed Algorithm for Point Cloud Denoising

V-D1 Optimizing the point cloud Y\mathbf{Y}Y

V-D2 Optimizing the Mahalanobis distance matrix M\mathbf{M}M

VI Experimental Results

VI-A Experimental Setup

VI-B Experimental Results

VI-B1 Demonstration of Iterations

VI-B2 Objective Comparison

VI-B3 Comparison with Vanilla Proximal Gradient

VI-B4 Subjective Comparison

VI-C Limitation

VII Conclusion

IV-A1 Importance of Off-diagonal Terms in $\mathbf{M}$

IV-C2 $\lambda_{\max}$ -bounded Reformulation

V-D1 Optimizing the point cloud $\mathbf{Y}$

V-D2 Optimizing the Mahalanobis distance matrix $\mathbf{M}$