TL;DR
This paper introduces a non-convex regularization approach for denoising vector-valued graph signals, demonstrating improved recovery performance and providing theoretical error bounds and an efficient algorithm.
Contribution
It extends graph trend filtering to vector-valued signals with non-convex penalties, offering better denoising and recovery capabilities along with convergence guarantees.
Findings
Non-convex regularizers outperform convex ones in denoising tasks.
The proposed ADMM algorithm converges reliably.
Numerical experiments validate improved performance on real-world data.
Abstract
This work studies the denoising of piecewise smooth graph signals that exhibit inhomogeneous levels of smoothness over a graph, where the value at each node can be vector-valued. We extend the graph trend filtering framework to denoising vector-valued graph signals with a family of non-convex regularizers, which exhibit superior recovery performance over existing convex regularizers. Using an oracle inequality, we establish the statistical error rates of first-order stationary points of the proposed non-convex method for generic graphs. Furthermore, we present an ADMM-based algorithm to solve the proposed method and establish its convergence. Numerical experiments are conducted on both synthetic and real-world data for denoising, support recovery, event detection, and semi-supervised classification.
| Symbol | Description | Dimension |
|---|---|---|
| oriented incidence matrix | ||
| th order graph difference operator | ||
| scalar-valued graph signal | ||
| vector-valued graph signal | ||
| noisy observation of | ||
| noisy observation of | ||
| -th row of | ||
| -th column of | ||
| spectral norm of |
| eigen decomposition | ||
|---|---|---|
| initialization | ||
| initialization | ||
| update | ||
| calculation | ||
| update | ||
| Total after iterations |
| Average | #1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | |
|---|---|---|---|---|---|---|---|---|---|
| Input SNR (dB) | 8.7 | -14 | 0 | 0 | 3.5 | 5.8 | 12 | 29 | 34 |
| Vector-GTF + | 29 | 10 | 20 | 23 | 26 | 36 | 37 | 39 | 38 |
| Scalar-GTF + | 21 | 0 | 11 | 13 | 16 | 18 | 26 | 41 | 45 |
| Vector-GTF + SCAD | 32 | 10 | 20 | 22 | 25 | 36 | 35 | 49 | 61 |
| Scalar-GTF + SCAD | 29 | 0 | 15 | 17 | 25 | 35 | 34 | 47 | 60 |
| Vector-GTF + MCP | 32 | 10 | 20 | 22 | 25 | 36 | 35 | 49 | 61 |
| Scalar-GTF + MCP | 29 | 0 | 15 | 22 | 24 | 30 | 33 | 49 | 60 |
| Heart | Wine quality | Wine | Iris | Breast | Car | ||
| # of samples () | 303 | 1599 | 178 | 150 | 569 | 1728 | |
| # of classes () | 2 | 6 | 3 | 3 | 2 | 4 | |
| 0.148 | 0.346 | 0.038 | 0.036 | 0.042 | 0.172 | ||
| SCAD p-value | 0.148 | 0.353 | 0.038 | 0.033 | 0.042 | 0.149 | |
| 1. | 0.06 | 1. | 0.27 | 1. | 0.06 | ||
| MCP p-value | 0.144 | 0.351 | 0.037 | 0.035 | 0.040 | 0.148 | |
| 0.23 | 0.18 | 0.34 | 0.34 | 0.35 | 0.05 | ||
| 0.143 | 0.351 | 0.034 | 0.039 | 0.035 | 0.104 | ||
| SCAD p-value | 0.144 | 0.350 | 0.034 | 0.039 | 0.035 | 0.104 | |
| 0.30 | 0.43 | 0.34 | 1. | 0.71 | 0.66 | ||
| MCP p-value | 0.146 | 0.350 | 0.034 | 0.039 | 0.034 | 0.103 | |
| 0.05 | 0.44 | 0.34 | 1. | 0.02 | 0.23 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Vector-Valued Graph Trend Filtering with Non-Convex Penalties
Rohan Varma, , Harlin Lee, , Jelena Kovačević, , and
Yuejie Chi R. Varma, H. Lee and Y. Chi are with the Dept. of Electrical and Computer Engineering, Carnegie Mellon University. Emails:{rohanv, harlinl, yuejiec}@andrew.cmu.edu.J. Kovačević is with the Tandon School of Engineering, New York University. Email: [email protected]. This work is supported in part by NSF under grants CCF-1563918, CCF-1826519, CCF-1806154 and ECCS-1818571, by ONR under grant N00014-18-1-2142, by ARO under grant W911NF-18-1-0303, and by NIH under grant R01EB025018. A preliminary version of partial results in this paper was presented at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing [1].The first two authors contributed equally.
Abstract
This work studies the denoising of piecewise smooth graph signals that exhibit inhomogeneous levels of smoothness over a graph, where the value at each node can be vector-valued. We extend the graph trend filtering framework to denoising vector-valued graph signals with a family of non-convex regularizers, which exhibit superior recovery performance over existing convex regularizers. Using an oracle inequality, we establish the statistical error rates of first-order stationary points of the proposed non-convex method for generic graphs. Furthermore, we present an ADMM-based algorithm to solve the proposed method and establish its convergence. Numerical experiments are conducted on both synthetic and real-world data for denoising, support recovery, event detection, and semi-supervised classification.
Index Terms:
graph signal processing, graph trend filtering, semi-supervised classification, non-convex optimization
I Introduction
Signal estimation from noisy observations is a classic problem in signal processing and has applications in signal inpainting, collaborative filtering, recommendation systems and other large-scale data completion problems. Since noise can have deleterious, cascading effects in many downstream tasks, being able to efficiently and accurately filter and reconstruct a signal is of significant importance.
With the explosive growth of information and communication, signals are generated at an unprecedented rate from various sources, including social networks, citation networks, biological networks, and physical infrastructure [2]. Unlike time series or images, these signals lie on complex, irregular structures, and require novel processing techniques, leading to the emerging field of graph signal processing [3]-[5]. This framework models the structure by a graph and generalizes concepts and tools from classical discrete signal processing to graph signal processing. The associated graph-structured data are referred to as graph signals.
In graph signal processing, a common assumption is that the graph signal is smooth with respect to the graph, that is, the signal coefficients do not vary much over local neighborhoods of the graph. However, this characterization is insufficient for many real-world signals that exhibit spatially inhomogeneous levels of smoothness over the graph. In social networks for example, within a given community or social circle, users’ profiles tend to be homogeneous, while within a different social circle they will be of different, yet still have homogeneous values. Consequently, the signal is often characterized by large variations between regions and small variations within regions such that there are localized discontinuities and patterns in the signal. As a result, it is necessary to develop representations and algorithms to process and analyze such piecewise smooth graph signals.
In this work, we study the denoising of the class of piecewise smooth graph signals (including but not limited to piecewise constant graph signals), which is complementary to the class of smooth graph signals that exhibit homogeneous levels of smoothness over the graph. The reconstruction of smooth graph signals has been well studied in previous work both within graph signal processing [4]-[10] as well as in the context of Laplacian regularization [11, 12].
The Graph Trend Filtering (GTF) framework [13], which applies total variation denoising to graph signals [14], is a particularly flexible and attractive approach that regularizes discrete graph differences using the norm. Although the norm based regularization has many attractive properties [15], the resulting estimates are biased toward zero for large coefficients. To alleviate this bias effect, non-convex penalties such as the Smoothly Clipped Absolute Deviation (SCAD) penalty [16] and the Minimax Concave Penalty (MCP) [17] have been proposed as alternatives. These penalties behave similarly to the norm when the signal coefficients are small, but tend to a constant when the signal coefficients are large. Notably, they possess the so-called oracle property: in the asymptotics of large dimension, they perform as well as the case where we know in advance the support of the sparse vectors [18]-[22].
In this work, we strengthen the GTF framework in [13] by considering a large family of possibly non-convex regularizers, including SCAD and MCP that exhibit superior reconstruction performance over minimization for the denoising of piecewise smooth graph signals. Furthermore, we extend the GTF framework to allow vector-valued signals, e.g. time series [23], on each node of the graph, which greatly broadens the applicability of GTF to applications in social networks [24], gene networks, and semi-supervised classification [25]-[27]. Through theoretical analyses and empirical performance, we demonstrate that the use of non-convex penalties improves the performance of GTF in terms of both reduced reconstruction error and improved support recovery, i.e. how accurately we can localize the discontinuities of the piecewise smooth signals. Our contributions can be summarized as follows:
- •
Theoretically, we derive the statistical error rates of the signal estimates, defined as first-order stationary points of the proposed GTF estimator. We derive the rates in terms of the noise level and the alignment of the ground truth signal with respect to the underlying graph, without making assumptions on the piecewise smoothness of the ground truth signal. The better the alignment, the more accurate the estimates. Importantly, the estimators do not need to be the global minima of the proposed non-convex problem, which are much milder requirements and important for the success of optimization. For denoising vector-valued signals, the GTF estimate is more accurate when each dimension of the signal shares similar patterns across the graph.
- •
Algorithmically, we propose an ADMM-based algorithm that is guaranteed to converge to a critical point of the proposed GTF estimator.
- •
Empirically, we demonstrate the performance improvements of the proposed GTF estimators with non-convex penalties on both synthetic and real data for signal estimation, support recovery, event detection, and semi-supervised classification.
The rest of this paper is organized as follows. Section II reviews related works and their relationships to the current paper. In Section III, we provide some background and definitions on graph signal processing and GTF. Section IV presents the proposed GTF framework with non-convex penalties and vector-valued graph signals. Section V develops its performance guarantees, and Section VI presents an efficient algorithm based on ADMM. Numerical performance of the proposed approach is examined on both synthetic and real-world data for denoising and semi-supervised classification in Section VII. Finally, we conclude in Section VIII and briefly discuss future work.
Throughout this paper, we use boldface letters and to represent vectors and matrices respectively. The transpose of is denoted as . The -th row of a matrix is denoted as , and the -th column of a matrix is denoted as . The cardinality of a set is denoted as . For any set and , we denote such that if and only if for . Similarly, we define a submatrix of that corresponds to pulling out the rows of indexed by . The norm of a vector is defined as , and the spectral norm of a matrix is defined as . The pseudo-inverse of a matrix is defined as . For a function , we write to denote the gradient or subdifferential of , if they exist, evaluated at . When the intention is clear, this may be written concisely as . We also follow the standard asymptotic notations. If for some constants , for all , then ; if , then . Finally, Table I summarizes some key notations used in this paper for convenience.
II Related Work and Connections
Estimators that adapt to spatial inhomogeneities have been well studied in the literature via regularized regression, total variation and splines [28]-[30]. Most of these methods involve locating change points or knots that denote a distinct change in the behavior of the function or the signal.
Our work is most related to the spatially adaptive GTF estimator introduced in [13] that smoothens or filters noisy signals to promote piecewise smooth behavior with respect to the underlying graph structure; see also [31]. In the same spirit as [28], the fused LASSO and univariate trend filtering framework developed in [14, 32, 33] use discrete difference operators to fit a time series signal using piecewise polynomials. The GTF framework generalizes univariate trend filtering by generalizing a path graph to arbitrarily complex graphs. Specifically, by appropriately defining the discrete difference operator, we can enforce piecewise constant, piecewise linear, and more generally piecewise polynomial behaviors over the graph structure. In comparison to previous work [13], in this paper, we have significantly expanded its scope by allowing vector-valued data over the graph nodes and a broader family of possibly non-convex penalties.
We note that while a significant portion of the relevant literature on GTF or the fused LASSO has focused on the sparsistency or support recovery conditions under which we can ensure the recovery of the location of the discontinuities or knots [34, 35], in this work, we study the asymptotic error rates of our estimator with respect to the mean squared error. Our analysis of error rates leverages techniques in [36, 37] that result in sharp error rates of total variation denoising via oracle inequalities, which we have carefully adapted to allow non-convex regularizers. The obtained error rates can be translated into bounds on support recovery or how well we can localize the boundary by leveraging techniques in [38].
Employing a graph-based regularizer that promotes similarities between the signal values at connected nodes has been investigated by many communities, such as graph signal processing, machine learning, applied mathematics, and network science. The Network LASSO proposed in [24], which is similar to the GTF framework with multi-dimensional or vector-valued data, focused on the development of efficient algorithms without any theoretical guarantees. The recent works by Jung et al. [26, 27, 39] have analyzed the performance of Network LASSO for semi-supervised learning when the graph signal is assumed to be clustered according to the labels using the network null space property and the network compatibility condition inspired by related concepts in compressed sensing [40]. In contrast, our analysis does not make assumptions on the graph signal, and the error rate is adaptive to the alignment of the signal and the graph structure used in denoising.
A well-studied generalization of the sparse linear inverse problem is when there are multiple measurement vectors (MMV), and the solutions are assumed to have a common sparsity pattern [41]-[43]. Sharing information across measurements, and thereby exploiting the conformity of the sparsity pattern, has been shown to significantly improve the performance of sparse recovery in compressive sensing and sparse coding [44]-[48]. Motivated by these works, we consider vector-valued graph signals that are regarded as multiple measurements of scalar-valued graph signals sharing discontinuity patterns.
There are a few variants of non-convex penalties that promote sparsity such as SCAD, MCP, weakly convex penalties, and () minimization [18, 49]-[52]. In this paper, we consider and develop theory for a family of non-convex penalties parametrized similarly to that in [18, 50] with SCAD and MCP as our prime examples, although it is valid for other non-convex penalties.
III Graph Signal Processing, Piecewise Smooth Signals, and Graph Trend Filtering
We consider an undirected graph , where is the set of nodes, is the set of edges, and is the unweighted adjacency matrix – also known as the graph shift operator [3]. The edge set represents the connections of the undirected graph , and the positive edge weight measures the underlying relation between the th and the th node, such as a similarity, a dependency, or a communication pattern. Let a scalar-valued graph signal be defined as
[TABLE]
where denotes the signal coefficient at the th node.
Let be the oriented incidence matrix of , where each row corresponds to an edge. That is, if the edge connects the th node to the th node (), the entries in the th row of is then given as
[TABLE]
The entries of the signal specifies the unweighted pairwise differences of the graph signal over each edge. As a result, can be interpreted as a graph difference operator. In graph signal processing, a signal is called smooth over a graph if is small.
III-A Piecewise Smooth Graph Signals
In practice, the graph signal may not be necessarily smooth over the entire graph, but only locally within different pieces of the graph. To model inhomogeneous levels of smoothness over a graph, we say that a graph signal is piecewise constant over a graph if many of the differences are zero for . Consequently, the difference signal is sparse and is small.
We can characterize piecewise th order polynomial signals on a graph, where the piecewise constant case corresponds to , by generalizing the notion of graph difference operators. Specifically, we use the following recursive definition of the th order graph difference operator [13]. Let for . For , let
[TABLE]
The signal is said to be a piecewise th order polynomial graph signal if is small. To further illustrate, let us consider the piecewise linear graph signal, corresponding to , as a signal whose value at a node can be linearly interpolated from the weighted average of the values at neighboring nodes. It is easy to see that this is the same as requiring the second-order differences to be sparse. Similarly, we say that a signal has a piecewise quadratic structure over a graph if the differences between the second-order differences defined for piecewise linear signals are mostly zero, that is, if is sparse. Fig. 1 illustrates various orders of piecewise graph smooth signals over the Minnesota road network graph.
III-B Denoising Piecewise Smooth Graph Signals via GTF
Assume we observe a noisy signal over the graph under i.i.d Gaussian noise:
[TABLE]
and seek to reconstruct from by leveraging the graph structure. When is a smooth graph signal, Laplacian smoothing [11, 12, 53]-[55] can be used, which solves the following problem:
[TABLE]
where . However, it cannot localize abrupt changes in the graph signal when the signal is piecewise smooth.
Graph trend filtering (GTF) [13] is a flexible framework for estimation on graphs that is adaptive to inhomogeneity in the level of smoothness of an observed signal across nodes. The th order GTF estimate is defined as:
[TABLE]
which can be regarded as applying total variation or fused LASSO with the graph difference operator [14, 56]. The sparsity-promoting properties of the norm have been well-studied [57]. Consequently, applying the penalty in GTF sets many of the (higher-order) graph differences to zero while keeping a small fraction of non-zero values. GTF is then adaptive over the graph; its estimate at a node adapts to the smoothness in its localized neighborhood.
IV Vector-Valued GTF with Non-convex Penalties
In this section, we first extend GTF to allow a broader family of non-convex penalties, and then extend it to handle vector-valued signals over the graph.
IV-A (Non-)convex Penalties
The norm penalty considered in (4) is well-known to produce biased estimates [58], which motivates us to extend the GTF framework to a broader class of sparsity-promoting regularizers that are not necessarily convex. We wish to minimize the following generalized th order GTF loss function:
[TABLE]
where
[TABLE]
is a regularizer defined as the sum of the penalty function applied element-wise to . Here, for even and for odd to account for different dimensions of ; see (1). We will refer to the GTF estimator that minimizes as scalar-GTF.
Similarly to [18, 20, 50], we consider a family of penalty functions that satisfies the following assumptions.
Assumption 1**.**
Assume satisfies the following:
- (a)
* satisfies , is symmetric around [math], and is non-decreasing on the real non-negative line.* 2. (b)
For , the function is non-increasing in . Also, is differentiable for all and sub-differentiable at , with . This upper bounds . 3. (c)
There exists such that is convex.
Many penalty functions satisfy these assumptions. Besides the penalty, the non-convex SCAD [16] penalty
[TABLE]
and the MCP [17]
[TABLE]
also satisfy them. We note that Assumption 1 (c) is satisfied for SCAD with and for MCP with . Fig. 2 illustrates the , SCAD and MCP penalties for comparison. While the non-convexity means that in general, we may not always find the global optimum of , it often affords us many other advantages. SCAD and MCP both taper off to a constant value, and hence apply less shrinkage for higher values. As a result, they mitigate the bias effect while promoting sparsity. Further, they are smooth and differentiable for and are both upper bounded by the penalty for all .
IV-B Vector-Valued GTF
In many applications, the signals on each node are in fact multi-dimensional or vector-valued, e.g. time series in social networks, multi-class labels in semi-supervised learning, feature vectors of different objects in feature selection. Therefore, it is natural to consider an extension to the graph signal denoising problem, where the graph signal on each node is a -dimensional vector instead of a scalar. In this scenario, we define a vector-valued graph signal to be piecewise smooth if it is piecewise smooth in each of its dimensions, and assume their discontinuities to coincide over the same small set of edges or nodes. Further, we denote the vector-valued signal of interest as , such that the th row of the matrix corresponds to the th node of the graph. The noise model for the observation matrix is defined as
[TABLE]
where each element of is drawn i.i.d from . A naïve approach is to estimate each column of separately via scalar-GTF:
[TABLE]
However, this formulation does not take full advantage of the multi-dimensionality of the graph signal. Instead, when the columns of are correlated, coupling them can be beneficial such that we encourage the sharing of information across dimensions or features. For example, if one column exhibits strong piecewise smoothness over the graph, and therefore has compelling evidence about the relationship between nodes, sharing that information to a related column can improve the overall denoising and filtering performance. As a result, we formulate a vector-GTF problem as follows:
[TABLE]
where the new penalty function is the sum of applied to the norm of each row of :
[TABLE]
By enforcing sparsity on , we are coupling to be of similar sparsity patterns across . Note the difference from (9), where elements of can be set to zero or non-zero independently.
V Theoretical Guarantees
In this section, we present the error rates and support recovery guarantees of the generalized GTF estimators, namely scalar-GTF (5) and vector-GTF (10), under the AWGN noise model. Before continuing, we first define a few useful quantities. Let be the number of connected components in the graph , or equivalently, the dimension of the null space of . Further, let be the number of rows of , and be the maximum norm of the columns of .
V-A Error Rates of First-order Stationary Points
Due to non-convexity, global minima of the proposed GTF estimators may not be attainable. Therefore, it is more desirable to understand the statistical performance of any first-order stationary points of the GTF estimators. We call a stationary point of , if it satisfies
[TABLE]
We further introduce the compatibility factor, which generalizes the notion used in [36] to allow vector-valued signals.
Definition 1** (Compatibility factor).**
Let be fixed. The compatibility factor of a set is defined as , and for nonempty set ,
[TABLE]
To further build intuition, consider . This is precisely the definition of , an induced norm of the submatrix of . If we consider signals with fixed power , will depend on how much the edges are connected to each other. Together with , can be related to the restricted eigenvalue condition, which is often used to bound the performance of LASSO [40]. With slight abuse of notation, we write .
We have the following oracle inequality that is applicable to the stationary points of the GTF estimators. The proof in Appendix A follows a construction that is similar to Theorem 2 in [36]. The oracle inequality holds for any that satisfies the first-order optimality condition, allowing the use of non-convex penalties. This mild condition on is a key difference from [13, Theorem 3] and [1, Theorem 1] that are applicable to the global minimizer, which is difficult to guarantee when using non-convex penalties. We also stress that although GTF was motivated by piecewise smooth graph signals, Theorem 1 holds for any graph and graph signal .
Theorem 1** (Oracle inequality of GTF stationary points).**
Assume . Fix . For scalar-GTF (5), let be a stationary point. Set , then
[TABLE]
with probability at least for any . Similarly, for vector-GTF (10), let be a stationary point. Set , then
[TABLE]
with probability at least for any .
Remark 1**.**
Recall that is defined in Assumption 1 (c), which characterizes how “non-convex” the regularizer is, and dictates the inflection point in Fig. 2. The assumption in Theorem 1 therefore implicitly constrains the level of non-convexity of the regularizer. Take MCP in (7) for example: since , we can guarantee the existence of a valid such that as long as we set .
Theorem 1 allows one to select and to optimize the error bounds on the right hand side of (12) and (13). For example, pick in (12) (hence an “oracle”) to have
[TABLE]
where .
- •
By setting as an empty set, we have
[TABLE]
which suggest that the reconstruction accuracy improves when the ground truth is better aligned with the graph structure, and consequently the value of is small.
- •
On the other hand, by setting as the support of , we achieve
[TABLE]
which grows linearly as we increase the sparsity level .
Similar discussions can be conducted for vector-GTF by choosing in (13). More importantly, we can directly compare the performance of vector-GTF with scalar-GTF, which was formulated for vector-valued graph signals in (9). The error bound of vector-GTF pays a small price in the order of , but is tighter than scalar-GTF if . This suggests that vector-GTF is much more advantageous when the support sets of for overlap, i.e. when the local discontinuities and patterns in are shared.
V-B Comparison with Scalar-GTF using Regularization
We compare our error bound for scalar-GTF that is on with [13, Theorem 3], which is obtained for GTF with the penalty, reproduced below for convenience.
Theorem 2** (Basic error bound of GTF minimizer).**
If , then , the minimizer of (4), satisfies
[TABLE]
The above bound is comparable to our bound in the special case of setting to an empty set, i.e. (15). The first term of the bound in (15) is upper bounded by that of Theorem 2. The non-convex regularization yields especially tighter bounds when contains large coefficients, so that . On the other hand, the second term of (15) contains in the denominator, which makes it an upper bound of the second term in Theorem 2. This gap can be brought down by choosing a larger , which allows one to pick a smaller , as mentioned in Remark 1. However, as , non-convex SCAD and MCP also tends to , which erases the improvement from using non-convex regularizers in the first term of the bound. This indicates a trade-off in the overall error bound based on , or the “non-convexity” of the regularizers chosen for scalar-GTF.
To sum up, despite being non-convex, we can guarantee that any stationary point of the proposed GTF estimator possesses strong statistical guarantees.
V-C Error Rates for Erdős-Rényi Graphs
We next specialize Theorem 1 to the Erdős-Rényi random graphs using spectral graph theory [59]. Let and respectively be the maximum and expected degree of the graph. It is known that for any graph it holds [13]
[TABLE]
where is the smallest non-zero eigenvalue of the graph Laplacian matrix . Moreover, we have , and [60]. Next, we present a simple lower bound on , which is proved in Appendix -B.
Proposition 1** (Bound on ).**
* is bounded for any and as*
[TABLE]
For an Erdős-Rényi random graph, if , we have almost surely [61, Corollary 8.2] and . Furthermore, [13, 59, 62], and for odd and for even . Therefore, with probability at least , we have
[TABLE]
where by plugging in .
These results are also applicable to -regular Ramanujan graphs [62].
V-D Support Recovery
An alternative yet important metric for gauging the success of the proposed GTF estimators is support recovery, which aims to localize the discontinuities in the piecewise smooth graph signals, i.e. the support set of , that is
[TABLE]
In particular, for odd , the discontinuities correspond to graph nodes; and for even , they correspond to the edges. Let be the GTF estimate of the graph signal. The quality of the support recovery can be measured using the graph screening distance [38]. For any and , let denote the length of the shortest path between them. The distance of from is then defined as
[TABLE]
Interestingly, Lin et.al. [38] showed recently that under mild assumptions, one can translate the error bound into a support recovery guarantee. Specifically, letting be the RHS of (12) that bounds the error in Theorem 1, we have
[TABLE]
where quantifies the minimum level of discontinuity, defined as the minimum absolute value of the non-zero values of , i.e.
[TABLE]
Consequently, this leads to support recovery guarantees of the proposed GTF estimators. Numerical experiment in Section VII-A verifies the superior performance of the non-convex regularizers over the regularizer for support recovery.
VI ADMM Algorithm and its Convergence
There are many algorithmic approaches to optimize the vector-GTF formulation in (10), since scalar-GTF (5) can be regarded as a special case with . In this section, we illustrate the approach adopted in this paper, which is the Alternating Direction Method of Multipliers (ADMM) framework for solving separable optimization problems [63].
Via a change of variable as , we can transform (10) to
[TABLE]
Its corresponding Lagrangian can be written as:
[TABLE]
where is the Lagrangian multiplier, and is the parameter. Alg. 1 shows the ADMM updates based on the Lagrangian in (21). Recall the proximal operator is defined as for a function . , SCAD and MCP all admit closed-form solutions of Prox, which are simple thresholding operations [64]. Furthermore, we have the following convergence guarantee for Alg. 1, whose proof is provided in Appendix -C. Theorem 3 implies that the output of Alg. 1 satisfies Theorem 1.
Theorem 3**.**
Let , then Alg. 1 converges to a stationary point of (10).
In addition, we provide a detailed time complexity analysis of Alg. 1 in Table II. Note that since is a sparse matrix with exactly non-zero entries, Alg. 1 can run much faster when . As a preprocessing step for each , we compute and , the eigenvectors and eigenvalues of , exactly once. can then be initialized very efficiently for all experiments that use .
VII Numerical Experiments
For the following experiments, we fixed for SCAD, for MCP. Further, the graphs we use in the following experiments satisfy Assumption 2 for this choice of . Unless explicitly mentioned, we tuned and for each experiment using the Hyperopt toolbox [65]. To meet the convergence criteria in Theorem 3, we enforced . SCAD/MCP were warm-started with the GTF estimate with penalty. Python packages PyGSP [66] and NetworkX [67] were used to construct and plot graphs. The input signal SNR was calculated as , while the reconstructed signal SNR was calculated as , where was the reconstruction. Computation time was measured with MacBook Pro 2017 with an 2.9 GHz Intel Core i7 and 16GB RAM. Our code is available at https://github.com/HarlinLee/nonconvex-GTF-public.
VII-A Denoising via GTF with Non-convex Regularizers
We first highlight via synthetic examples two important advantages that non-convex regularizers provide over the penalty.
- •
Bias Reduction: We demonstrate the reduction in signal bias in Fig. 3 for the graph signal defined over a 2D-grid graph, using both the penalty and the MCP penalty. Clearly, the MCP estimate (orange) has less bias than the estimate (blue), and can recover the ground truth surface (purple) more closely.
- •
Support Recovery: We illustrate the improved support recovery performance of non-convex penalties on localizing the boundaries for a piecewise constant signal on the Minnesota road graph, shown in Fig. 5. Particularly, we look at how well our estimator localizes the support of , that is, the discontinuity of the piecewise constant graph signal by looking at how well we can classify an edge as connecting two nodes in the same piece or being a cut edge across two pieces. By sweeping the regularization parameter , we obtain the ROC curve in Fig. 4, i.e. the true positive rate versus the false positive rate of classifying a cut edge correctly, and see that scalar-GTF with MCP and SCAD consistently outperforms the scalar-GTF with penalty.
Then, we compare the performance of GTF using non-convex regularizers such as SCAD and MCP with that using the norm more rigorously. For the ground truth signal , we construct a piecewise constant signal on a 2D-grid graph and the Minnesota road graph [66] as shown in the left panel of Fig. 5, and add different levels of noise following (2). We recover the signal by scalar-GTF with Alg. 1, and plot the SNR of the reconstructed signal versus the SNR of the input signal in solid lines in the middle panel of Fig. 5, averaged over and realizations, respectively. SCAD/MCP consistently outperforms in denoising graph signals defined over both regular and irregular structures.
VII-B Denoising Vector-valued Signals via GTF
We compare the performance of vector-GTF in (10) with (9), which applies scalar-GTF to each column of the vector-valued graph signal. The convex norm, and the non-convex SCAD and MCP are employed. We reuse the same ground truth graph signals over the 2D-grid graph and the Minnesota road graph constructed in Section VII-A in Fig. 5. independent noisy realizations of the graph signal are concatenated to construct a noisy vector-valued graph signal with dimension on the 2D-grid graph and with on the Minnesota road graph. We recover the vector-valued signal by minimizing vector-GTF (10) with Alg. 1.
The middle panel of Fig. 5 plots the average SNR of the reconstructed signal versus the average SNR of the input signal in dotted lines. We emphasize that the performance of (9) is the same as applying scalar-GTF to each realization, which is shown in the middle panel of Fig. 5 in solid lines. As before, SCAD/MCP consistently outperforms in denoising signals over both regular and irregular graphs. Furthermore, as expected, due to the sharing of information across realizations, vector-GTF consistently outperforms scalar-GTF, especially in the low SNR regime.
The right panel of Fig. 5 plots the computation time versus the gain in SNR from denoising via vector-GTF. 10 trials are performed for each regularizer with the input signal SNR fixed at 20dB. Parameter tuning and eigen decomposition of are preprocessing steps, and hence they are not included in the time measurement; but for reference, the eigen decomposition took 0.025 and 2.5 seconds for 2D-grid and Minnesota graphs, respectively. Since GTF with non-convex regularizers are warm-started by the estimate, the runtime for GTF is added to the SCAD/MCP runtime. Overall, running vector-GTF with SCAD/MCP after once with takes more time, but with large benefits in the denoising performance. Even with the additional computation time, Vector-GTF runs reasonably fast; with the Minnesota road network, where and , computation takes less than 25 seconds.
We further investigate the benefit of sharing information across measurements or realizations in the following experiment, using the same ground truth signal on the 2D-grid graph. We stack eight noisy realizations of this same piecewise constant signal to build a vector-valued signal. We construct these noisy measurements by scaling each one of them differently and randomly such that each will have SNR dB under (2). This has the effect of rendering some measurements more informative than others, and potentially allowing vector-GTF to reap the benefits of sharing information across measurements. We recover the 8-dimensional graph signal via Alg. 1 using , SCAD, and MCP regularizers, and in Table III, report the input signal and reconstructed signal SNRs for each measurement in addition to the average SNRs. is fixed at .
First of all, notice that as before, using SCAD/MCP generally achieves results with higher SNR than using , and that on average, minimizing (10) outperforms minimizing (9). The effect of sharing information across measurements is most apparent in low SNR settings, when information about the boundaries of the graph signal can be borrowed from higher SNR signals to improve the estimation. On the other hand, sharing information with noisier signals does not help denoising signals with high input SNR. However, it is worth noting that, unlike , SCAD/MCP does not see decrease in its performance in the high SNR settings.
VII-C Event Detection with NYC Taxi Data
To further illustrate graph trend filtering on a real-world dataset, we consider the road network of Manhattan where the nodes correspond to junctions [68]. We map the pickups and dropoffs of the NYC taxi trip dataset [69, 70] to the nearest road junctions, and define the total count at that junction to be the signal value on the corresponding graph node. The signal of interest, plotted on the top left panel of Fig. 6, is the difference between the event graph signal on the day of NYC Gay Pride parade, 12-2pm on June 26, 2011, and the seasonal average graph signal at the same time during the 8 nearest Sundays. During the event, no pickups and dropoffs could occur in the areas shown in the top right panel of Fig. 6 [71]. We denoise the signal via GTF using both and MCP, where we chose such that . Once again, we observe the GTF estimate with MCP produces sharper traces around the parade route, indicating better capabilities of event detection and localization.
VII-D Semi-supervised Classification
Graph-based learning provides a flexible and attractive way to model data in semi-supervised classification problems when vast amounts of unlabeled data are available compared to labeled data, and labels are expensive to acquire [11, 12, 55]. One can construct a nearest-neighbor graph based on the similarities between each pair of samples, and hope to propagate the label information from labeled samples to unlabeled ones. We move beyond our original problem in (5) to a -class classification problem in a semi-supervised learning setting, where for a given dataset with samples, we observe a subset of the one-hot encoded class labels, , such that if th sample has been observed to be in th class, and otherwise. A diagonal indicator matrix denotes samples whose class labels have been observed. Then, we can define the modified absorption problem [12, 13, 55] using a variation of GTF to estimate the unknown class probabilities :
[TABLE]
where (set to be uniform in the experiment) is a fixed prior belief, and determines how much emphasis to be given to the prior belief. The labels can be estimated using such that if and only if , and otherwise . Note that this can be completely separated into scalar-GTF problems, one corresponding to each class.
We applied the algorithm in to 6 popular UCI classification datasets [72] with . For each dataset, we normalized each feature to have zero mean and unit variance, and constructed a 5-nearest-neighbor graph of the samples based on the Euclidean distance between their features, with edge weights from the Gaussian radial basis kernel. We observed the labels of 20% of samples in each class randomly. Table IV shows the misclassification rates averaged over repetitions, which demonstrates that the performance using non-convex penalties such as SCAD/MCP are at least competitive with, and often better than, those with the penalty.
VIII Conclusions
We presented a framework for denoising piecewise smooth signals on graphs that generalizes the graph trend filtering framework to handle vector-valued signals using a family of non-convex regularizers. We provided theoretical guarantees on the error rates of our framework, and derived a general ADMM-based algorithm to solve this generalized graph trend filtering problem. Furthermore, we demonstrated the superior performance of these non-convex regularizers in terms of reconstruction error, bias reduction, and support recovery on both synthetic and real-world data. In particular, its performance on semi-supervised classification is investigated. In future work, we plan to further study this approach when the graph signals are observed by indirect and incomplete measurements.
-A Proof of Theorem 1
Proof.
We denote . Define as the row space of , and the null space. Let , the projection onto , and . Additionally, , the projection onto . Since is a stationary point of , it follows that
[TABLE]
By the chain rule, . Then by (23), there exists , such that
[TABLE]
In particular, , we have
[TABLE]
and, specializing to ,
[TABLE]
Subtract (25) from (24), and use the definition of subgradient to get ,
[TABLE]
By the measurement model and the polarization equality, i.e. , the left-hand side of (-A) can be rewritten as
[TABLE]
Combining (-A) and (27) gives us
[TABLE]
Let us first consider . From Hölder’s inequality,
[TABLE]
By standard tail bounds for independent Gaussian random variables, we have with probability at least ,
[TABLE]
Additionally, recognize that is a chi-squared random variable with degrees of freedom. We can then invoke the one-sided tail bound for chi-squared random variables (c.f. [73, Example 2.5]) such that for any ,
[TABLE]
Consequently, with probability at least ,
[TABLE]
The inequalities (31) and (30) hold simultaneously with probability at least . Then, using and , we can bound (29) further as
[TABLE]
Together with , we can upper bound (-A) as
[TABLE]
Note that for any set , . Therefore, using the triangle inequality and subadditivity and symmetry of ,
[TABLE]
We bound (33) further by the compatibility factor,
[TABLE]
Now combining (32), (33), and (34), we then have
[TABLE]
Apply Young’s inequality, which is for , with , , and , we have
[TABLE]
Therefore,
[TABLE]
Cancel on both sides, apply the infimum over and plug in the bounds (31) to get
[TABLE]
The proof extends for the vector-GTF (10) in a similar manner. We need to replace (29) by
[TABLE]
where \|\mathcal{P}_{\mathrm{R}^{\bot}}\bm{E}\|_{\mathrm{F}}^{2}\leq d\sigma^{2}\Big{(}C_{G}+2\sqrt{2C_{G}\log(d/\delta)}\Big{)} with probability at least . Similarly, for (34), we use the generalized definition of the compatibility factor , given as
[TABLE]
which will lead to the claimed bound in the theorem. ∎
-B Proof of Proposition 1
Proof.
By Cauchy-Schwartz inequality, we have
[TABLE]
and note that given two matrices and , where is a subset of rows indices. We also use the fact that . We consider two cases:
- •
For even , we have
[TABLE]
Note that is equivalent to the incidence matrix of a subgraph with only edges, which allows us to bound,
[TABLE]
where is the degree of node .
- •
For odd , we have
[TABLE]
where is the principal submatrix of indexed by . By Cauchy’s interlacing theorem, the maximum eigenvalue of the submatrix is upper bounded, so
[TABLE]
Therefore, for all , . To conclude the proof, note
[TABLE]
∎
-C Proof of Theorem 3
We show the convergence of Alg. 1 by proving a modified version of [22, Proposition 1]. The superscript denotes the values of at the th iteration of the loop inside Alg. 1.
Proposition 2** (Convergence to a feasible solution).**
If , then the primal residual and the dual residual of Alg. 1 satisfy that and .
Proof.
Denote , and . Recall from Assumption 1 (c) that there exists such that is convex. Now consider the Lagrangian with regard to the -th row of , assuming all other variables are fixed:
[TABLE]
where and represent terms of that do not depend on . With our choice of , then is convex with regard to each of , , and for each row of , allowing us to apply [74, Theorem 5.1]. Therefore, Alg. 1 converges to limit points .
Then it follows that the dual residual . For the primal residual, notice that the update step in line 10 of Alg. 1 also shows that ,
[TABLE]
Fixing and setting , we have
[TABLE]
holds . Hence, , and therefore . ∎
This proposition shows that the algorithm in the limit achieves primal and dual feasibility, and that the Augmented Lagrangian in (21) with and becomes the original GTF formulation in (10). that is produced by every iteration of Alg. 1 is a stationary point of (21) with fixed and . As a result, is a stationary point of (10).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Varma, H. Lee, Y. Chi, and J. Kovačević, “Improving graph trend filtering with non-convex penalties,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 5391–5395.
- 2[2] M. Newman, Networks . Oxford University Press, 2018.
- 3[3] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing , vol. 61, no. 7, pp. 1644–1656, 2013.
- 4[4] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, 2013.
- 5[5] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE , vol. 106, no. 5, pp. 808–828, 2018.
- 6[6] S. Chen, R. Varma, A. Singh, and J. Kovačević, “Signal representations on graphs: Tools and applications,” ar Xiv preprint ar Xiv:1512.05406 , 2015.
- 7[7] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević, “Discrete signal processing on graphs: Sampling theory,” IEEE Transactions on Signal Processing , vol. 63, no. 24, pp. 6510–6523, 2015.
- 8[8] S. Chen, A. Sandryhaila, J. M. Moura, and J. Kovačević, “Signal recovery on graphs: Variation minimization,” IEEE Transactions on Signal Processing , vol. 63, no. 17, pp. 4609–4624, 2015.
