Nonlinear Function Estimation with Empirical Bayes and Approximate   Message Passing

Hangjin Liu; You (Joe) Zhou; Ahmad Beirami; and Dror Baron

arXiv:1907.02482·cs.IT·October 2, 2019

Nonlinear Function Estimation with Empirical Bayes and Approximate Message Passing

Hangjin Liu, You (Joe) Zhou, Ahmad Beirami, and Dror Baron

PDF

Open Access

TL;DR

This paper introduces a method for nonlinear function estimation by reducing the problem to a linear one with polynomial kernels, and employs AMP algorithms for Bayesian and empirical Bayes coefficient estimation, outperforming LASSO.

Contribution

It presents a novel approach combining polynomial kernel expansion with AMP algorithms for nonlinear function estimation, demonstrating improved accuracy over traditional methods.

Findings

01

AMP-based methods outperform LASSO in prediction accuracy.

02

Kernel expansion with polynomial features yields well-conditioned matrices.

03

Empirical Bayes approach effectively estimates coefficients in nonlinear settings.

Abstract

Nonlinear function estimation is core to modern machine learning applications. In this paper, to perform nonlinear function estimation, we reduce a nonlinear inverse problem to a linear one using a polynomial kernel expansion. These kernels increase the feature set, and may result in poorly conditioned matrices. Nonetheless, we show several examples where the matrix in our linear inverse problem contains only mild linear correlations among columns. The coefficients vector is modeled within a Bayesian setting for which approximate message passing (AMP), an algorithmic framework for signal reconstruction, offers Bayes-optimal signal reconstruction quality. While the Bayesian setting limits the scope of our work, it is a first step toward estimation of real world nonlinear functions. The coefficients vector is estimated using two AMP-based approaches, a Bayesian one and empirical Bayes.…

Tables2

Table 1. TABLE I: Empirical value of σ 1 2 subscript superscript 𝜎 2 1 \sigma^{2}_{1} compared to our prediction ( 11 ).

$M$	$N$	$L$ (8)	$σ_{1}^{2}$ (empirical)	$σ_{1, p r e d}^{2}$ (11)
1000	10	66	4.99	4.39
1500	15	136	6.72	6.08
2000	20	231	8.41	7.77
3000	20	231	8.35	7.74
3000	30	496	11.84	11.16
4000	40	861	15.23	14.54
4500	50	1326	18.63	17.95
5000	60	1891	22.09	21.37
5500	70	2556	25.51	24.79
5000	80	3321	29.06	28.31
6000	80	3321	28.94	28.21
8000	80	3321	28.78	28.07
8000	90	4486	32.26	31.51
6000	100	5151	35.93	35.18
8000	100	5151	35.66	34.96

Table 2. TABLE II: Empirical MSE on test data for nonlinear function estimation.

Measurement Rate

R = \frac{M}{L}

Median MSE over

20 Realizations

LASSO

AMP

Pseudoinverse

0.14

0.0382

0.0293

0.041

0.28

0.0298

0.0228

0.033

0.56

0.0063

0.0036

0.01

Equations40

y_{m} = f (x_{m}) + z_{m},

y_{m} = f (x_{m}) + z_{m},

y = f (X) + z \in R^{M} .

y = f (X) + z \in R^{M} .

y = X θ + z,

y = X θ + z,

θ^{t + 1}

θ^{t + 1}

r^{t}

R = M / N

R = M / N

q^{t} = X^{T} r^{t} + θ^{t} = θ + v^{t},

q^{t} = X^{T} r^{t} + θ^{t} = θ + v^{t},

X_{Q} = 11 ⋮ 1 x_{11} \dots x_{1 N} x_{21} \dots x_{2 N} ⋮ x_{M 1} \dots x_{M N} x_{11}^{2} \dots x_{1 N}^{2} x_{21}^{2} \dots x_{2 N}^{2} ⋮ x_{M 1}^{2} \dots x_{M N}^{2} x_{11} x_{12} \dots x_{1 (N - 1)} x_{1 N} x_{21} x_{22} \dots x_{2 (N - 1)} x_{2 N} ⋮ x_{M 1} x_{M 2} \dots x_{M (N - 1)} x_{M N} .

X_{Q} = 11 ⋮ 1 x_{11} \dots x_{1 N} x_{21} \dots x_{2 N} ⋮ x_{M 1} \dots x_{M N} x_{11}^{2} \dots x_{1 N}^{2} x_{21}^{2} \dots x_{2 N}^{2} ⋮ x_{M 1}^{2} \dots x_{M N}^{2} x_{11} x_{12} \dots x_{1 (N - 1)} x_{1 N} x_{21} x_{22} \dots x_{2 (N - 1)} x_{2 N} ⋮ x_{M 1} x_{M 2} \dots x_{M (N - 1)} x_{M N} .

f (x) = ℓ = 1 \sum L θ_{ℓ} g_{ℓ} (x) .

f (x) = ℓ = 1 \sum L θ_{ℓ} g_{ℓ} (x) .

y_{m} =

y_{m} =

+ n_{1} = 1 \sum N n_{2} = n_{1} + 1 \sum N [θ_{4}]_{n} x_{m n_{1}} x_{m n_{2}},

y = X_{Q} θ + z = X_{Q} θ_{1} θ_{2} θ_{3} θ_{4} + z,

y = X_{Q} θ + z = X_{Q} θ_{1} θ_{2} θ_{3} θ_{4} + z,

L = 1 + 2 N + \frac{N ( N - 1 )}{2} .

L = 1 + 2 N + \frac{N ( N - 1 )}{2} .

y = X_{Q} θ + z = X_{Q}^{'} θ^{'} + z,

y = X_{Q} θ + z = X_{Q}^{'} θ^{'} + z,

[X_{Q}^{'}]_{ℓ m} = \frac{[ X _{Q} ] _{ℓ m}}{∣∣ [ X _{Q} ] _{ℓ} ∣ ∣ _{2}},

[X_{Q}^{'}]_{ℓ m} = \frac{[ X _{Q} ] _{ℓ m}}{∣∣ [ X _{Q} ] _{ℓ} ∣ ∣ _{2}},

θ^{'}_{ℓ} = θ_{ℓ} ∣∣ [X_{Q}]_{ℓ} ∣ ∣_{2} .

θ^{'}_{ℓ} = θ_{ℓ} ∣∣ [X_{Q}]_{ℓ} ∣ ∣_{2} .

σ_{1, p r e d}^{2} = 1 + N /3 + \frac{N ( N + 1 )}{2 M} .

σ_{1, p r e d}^{2} = 1 + N /3 + \frac{N ( N + 1 )}{2 M} .

θ = \frac{1}{2} θ argmin ∣∣ y - X_{Q} θ ∣ ∣_{2}^{2} + j = 1 \sum 4 λ_{j} ∥ θ_{j} ∥_{1},

θ = \frac{1}{2} θ argmin ∣∣ y - X_{Q} θ ∣ ∣_{2}^{2} + j = 1 \sum 4 λ_{j} ∥ θ_{j} ∥_{1},

θ_{ℓ} = \frac{θ ^{'} _{ℓ}}{∣∣ [ X _{Q} ] _{ℓ} ∣ ∣ _{2}},

θ_{ℓ} = \frac{θ ^{'} _{ℓ}}{∣∣ [ X _{Q} ] _{ℓ} ∣ ∣ _{2}},

\frac{∣∣ y _{t es t} - X _{t es t} θ ∣ ∣ _{2}^{2}}{K} = \frac{∣∣ X _{t es t} ( θ - θ ) ∣ ∣ _{2}^{2}}{K},

\frac{∣∣ y _{t es t} - X _{t es t} θ ∣ ∣ _{2}^{2}}{K} = \frac{∣∣ X _{t es t} ( θ - θ ) ∣ ∣ _{2}^{2}}{K},

y = i = 1 \sum 3 w_{i} sin (X ρ_{i} + ϕ_{i}) + z,

y = i = 1 \sum 3 w_{i} sin (X ρ_{i} + ϕ_{i}) + z,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Blind Source Separation Techniques · Image and Signal Denoising Methods

Full text

Nonlinear Function Estimation

with Empirical Bayes

and Approximate Message Passing

Hangjin Liu,† You (Joe) Zhou,† Ahmad Beirami,‡ and Dror Baron†

†Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA

‡Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Email: {hliu25,yzhou26,barondror}@ncsu.edu, [email protected]

Abstract

Nonlinear function estimation is core to modern machine learning applications. In this paper, to perform nonlinear function estimation, we reduce a nonlinear inverse problem to a linear one using a polynomial kernel expansion. These kernels increase the feature set, and may result in poorly conditioned matrices. Nonetheless, we show several examples where the matrix in our linear inverse problem contains only mild linear correlations among columns. The coefficients vector is modeled within a Bayesian setting for which approximate message passing (AMP), an algorithmic framework for signal reconstruction, offers Bayes-optimal signal reconstruction quality. While the Bayesian setting limits the scope of our work, it is a first step toward estimation of real world nonlinear functions. The coefficients vector is estimated using two AMP-based approaches, a Bayesian one and empirical Bayes. Numerical results confirm that our AMP-based approaches learn the function better than LASSO, offering markedly lower error in predicting test data.

Index Terms:

Approximate message passing, function estimation, kernel regression, nonlinear functions, Taylor series.

I Introduction

A pervasive trend in modern society is that ever-larger amounts of data are being collected and analyzed in order to explain various phenomena. In supervised learning, many variables (also referred to as features) that may relate to and thus help explain the phenomena of interest are observed, and the goal is to learn a function — often a nonlinear one — that relates the explanatory variables to the phenomena of interest. More specifically, we have a multivariate nonlinear function, $\mathbf{f}(\cdot)$ , and collect noisy samples of it; our goal is to estimate $\mathbf{f}(\cdot)$ . At its core, this is multivariate nonlinear function estimation; it could also be interpreted as nonlinear regression or feature selection. Algorithms for solving such problems must be robust to noisy observations and outliers, backed up by fundamental mathematical analysis, support missing data, and have a fast implementation that scales well to large-scale problems. Such algorithms will impact many disciplines, such as health informatics [1], social networks, and finance [2, 3, 4].

Example applications: Let us describe how nonlinear function estimation can be used in financial prediction [2, 3, 4]. A typical approach to estimate expected returns uses a linear factor model, which is tuned to work well on training data, $y_{m}=\sum_{n}{{\bf X}_{mn}}\theta_{n}+z_{m}$ , where $y_{m}$ is the price change of asset $m$ , ${{\bf X}_{mn}}$ is the exposure of asset $m$ to factor $n$ , $\theta_{n}$ are the returns of factor $n$ , and $z_{m}$ is noise in asset $m$ . We can express the linear model in matrix vector form, ${\bf y}={\bf X}{\boldsymbol{\theta}}+{\bf z}$ , where ${\bf X}$ is an input data matrix, by assigning $y_{m}$ as the $m$ -th entry of the vector ${\bf y}$ , $\theta_{n}$ as the $n$ -th entry of the vector ${\boldsymbol{\theta}}$ , and $\mathbf{X}_{mn}$ as the element of the matrix ${\bf X}$ in row $m$ and column $n$ The goal is to estimate ${\boldsymbol{\theta}}$ from ${\bf y}$ , ${\bf X}$ , and possible statistical knowledge about ${\boldsymbol{\theta}}$ and ${\bf z}$ . We can see that financial prediction based on linear models relies on solving linear inverse problems. That said, some factors relate to returns in a nonlinear way [5], and financial prediction could be improved using nonlinear schemes.

Nonlinear modeling can also be used in health informatics [1], where $\mathbf{y}$ could measure patients’ medical condition, ${\bf X}$ contains nonlinear exposure terms, and ${\boldsymbol{\theta}}$ are explanatory variables that drive the patients’ condition. Our goal is to understand the relationships between explanatory variables and patients’ medical condition.

Main idea and contributions: In this paper, as a first step toward learning nonlinear functions, we cast them as linear inverse problems using polynomial kernels [6, 7] (Sec. III). Incorporating kernels into the matrix maps the nonlinear signal estimation procedure into a linear inverse problem but with an increased feature set, where the features are no longer independent and identically distributed (i.i.d.). Unfortunately, the kernels may create poorly conditioned matrices, and many solvers for linear inverse problems struggle with such matrices. Nonetheless, the matrices in our linear inverse problems often contain only mild linear correlations among columns, and are reasonably well conditioned.

While the polynomial kernels greatly increase the richness of the model class that captures the phenomena of interest, they also significantly increase the dimensionality of the features. For example, $N$ factors evaluated with quadratic kernels will become approximately $\frac{1}{2}N^{2}$ new factors. This large scale and well-conditioned linear inverse problem is well-suited to approximate message passing (AMP) [8, 9], an algorithmic framework for signal reconstruction that is asymptotically optimal for large scale linear inverse problems in the sense that it achieves best-possible reconstruction quality [10, 11]. Our AMP-based algorithms improve reconstruction quality of the coefficients vector ${\boldsymbol{\theta}}$ , leading to better estimation of the nonlinear function.

Two AMP-based approaches are considered. The first follows a Bayesian framework, where we assume that the coefficients vector follows some known probabilistic structure. While the Bayesian framework is naive and limited in scope, our past work has shown that universal approaches that adapt to unknown statistical distributions can be integrated within solvers for linear inverse problems [12, 13, 14], thus bypassing the Bayesian limitation. The linear inverse problems resulting from our polynomial kernel expansion is solved using an AMP-based algorithm, whose Bayes optimality ensures that our function estimation procedure can succeed despite using fewer and noisier samples than other methods.

The second AMP-based approach uses empirical Bayes [15], where the coefficients vector is assumed to follow some parametric distribution, and in each iteration of AMP we plug maximum likelihood parameter estimates into a parametric Bayesian denoiser.

The resulting algorithms will allow data to better model dependencies between explanatory variables and phenomena of interest. These algorithms could also help reconstruct signals acquired by nonlinear analog systems, allowing hardware designers to exploit nonlinearities rather than avoid them.

Organization: The rest of the paper is organized as follows. Section II provides background content. Details of our approach for estimating multivariate nonlinear functions appear in Section III. Numerical results appear in Section IV, and Section V concludes.

II Background

II-A Inverse problems

We present a flexible formulation for nonlinear function estimation in the form of a nonlinear inverse problem. We observe $M$ independent samples of the form $\{(\mathbf{x}_{m},y_{m})\}_{{m}\in{\{1,\ldots,M\}}}$ , where $(\mathbf{x}_{m},y_{m})\in\mathbb{R}^{N}\times\mathbb{R}$ , through a nonlinear function $f(\cdot)$ and additive noise,

[TABLE]

for all ${m}\in\{1,\ldots,M\}$ . In other words, the input data matrix ${\bf X}{\in\mathbb{R}^{M\times N}}$ , where ${\bf X}$ are locations of samples, will be processed by applying a multivariate operator, ${\bf f}(\cdot):{\mathbb{R}^{M}\times}N\rightarrow{\mathbb{R}^{M}}$ , such that $\bf f$ applies $f$ on each individual row of the data matrix $\bf x$ , with additive noise, ${\bf z}\in{\mathbb{R}^{M}}$ , resulting in noisy measurements,

[TABLE]

While the reader is likely familiar with linear inverse problems, where the operator ${\bf f}$ boils down to multiplication by a coefficients vector $\boldsymbol{\theta}$ , i.e., ${\bf f}({\bf X})={\bf X}\boldsymbol{\theta}$ , our main interest is in nonlinear inverse problems.

We highlight that many “rules of thumb” that the sparse signal processing community has claimed, for example that sparse signals can be reconstructed from a small number of linear measurements, $M<N$ , may break down when the measurement noise ${\bf z}$ is large or the operator ${\bf f}(\cdot)$ contains significant nonlinearities.

II-B Approximate message passing (AMP)

One approach for solving linear inverse problems is AMP [8, 9], which is an iterative algorithm that successively converts the matrix problem to scalar channel denoising problems with additive white Gaussian noise (AWGN). AMP is a fast approximation to precise message passing (cf. Baron et al. [16], Montanari [9], and references therein), and has received considerable attention because of its fast convergence and the state evolution (SE) formalism [8, 17, 9], which characterizes how the mean squared error (MSE) achieved by the next iteration of AMP can be predicted using the MSE performance of the denoiser being used. AMP solves the following linear inverse problem,

[TABLE]

where the empirical probability density function (pdf) of $\boldsymbol{\theta}$ follows $p_{\boldsymbol{\theta}}(\boldsymbol{\theta})$ , the operator ${\bf f(X)}$ multiplies ${\bf X}$ by the unknown coefficients vector $\boldsymbol{\theta}$ , and ${\bf z}$ is AWGN with variance $\sigma_{Z}^{2}$ . Although the AMP literature mainly considers i.i.d. Gaussian matrices, approaches such as damping [18] and Swept AMP [19] have been proposed to deal with more general matrices. After initializing ${\boldsymbol{\theta}}^{0}$ and ${\bf r}^{0}$ , AMP [8, 9] proceeds iteratively according to

[TABLE]

where $(\cdot)^{T}$ denotes the transpose,

[TABLE]

is the measurement rate, $\eta^{t}(\cdot)$ is a denoising function, and $\langle{\bf u}\rangle=\frac{1}{N}\sum_{i=1}^{N}u_{i}$ for some vector ${\bf u}\in\mathbb{R}^{N}$ . The denoising function $\eta^{t}(\cdot)$ operates in a symbol-by-symbol manner (also known as separable) in the original derivation of AMP [8, 9]. That is, $\eta^{t}({\bf u})=(\eta^{t}(u_{1}),\eta^{t}(u_{2}),...,\eta^{t}(u_{N}))$ and $\eta^{t^{\prime}}({\bf u})=(\eta^{t^{\prime}}(u_{1}),\eta^{t^{\prime}}(u_{2}),...,\eta^{t^{\prime}}(u_{N}))$ , where $\eta^{t^{\prime}}(\cdot)$ denotes the derivative of $\eta^{t}(\cdot)$ .

A useful property of AMP in the large system limit ( $N,M\rightarrow\infty$ with the measurement rate $R$ constant) is that at each iteration, the vector ${\bf X}^{T}{\bf r}^{t}+{\boldsymbol{\theta}}^{t}\in\mathbb{R}^{N}$ in (4) is equivalent to the unknown coefficients vector $\boldsymbol{\theta}$ corrupted by AWGN. This property is based on the decoupling principle [20, 10, 21], which states that the posterior of a linear inverse problem (3) is statistically equivalent to a scalar channel. We denote the equivalent scalar channel at iteration $t$ by

[TABLE]

where $v^{t}_{i}\sim\mathcal{N}(0,\sigma_{t}^{2})$ , and $\mathcal{N}(\mu,\sigma^{2})$ is a Gaussian pdf with mean $\mu$ and variance $\sigma^{2}$ . AMP with separable denoisers, which are optimal for i.i.d. signals, has been rigorously proved to obey SE [17]. However, we will see in Section II-C that non-i.i.d. signals can be denoised better using non-separable denoisers.

Another useful property of AMP in the large system limit involves a Bayesian setting where a prior distribution for the coefficients vector $\boldsymbol{\theta}$ is available. In such Bayesian settings, AMP can use denoiser functions $\eta^{t}(\cdot)$ that minimize the MSE in each iteration $t$ [17]. Using such MSE-optimal denoisers, the MSE performance of AMP (4) approaches the minimum mean squared error (MMSE) as $t$ is increased.

II-C Non-scalar denoisers

While i.i.d. signals can be denoised in a scalar separable fashion within AMP, where each signal entry is denoised using the same scalar denoiser, real-world signals often contain dependencies between signal entries. For example, adjacent pixels in images are often similar in value, and scalar separable denoisers ignore these dependencies. Therefore, we apply non-separable denoisers to process non-i.i.d. signals within AMP. For example, if ${\boldsymbol{\theta}}$ is a time series containing dependencies between adjacent entries, then we can use a sliding window denoiser that processes entry $n$ of ${\boldsymbol{\theta}}$ using information from its neighbors [22, 14].

We will see in Section III that our signal reconstruction problem includes several types of coefficients in ${\boldsymbol{\theta}}$ , and we expect dependencies between coefficients. Therefore, non-scalar denoisers will be used within AMP to process non-i.i.d. coefficients.

III Learning nonlinear functions

Having reviewed relevant background material, we now recast nonlinear inverse problems (2) as linear inverse problems (3) using polynomial kernels [6, 7], which replace our input data matrix $\bf X$ with transformations of $\bf X$ [23].

Our nonlinear model (2) is motivated by the inadequacy of linear relationships in some applications. One example involves bioinformatics, where genetic factors involve multiplicative interactions among genes [24]. Another application involving financial prediction [2, 3, 4], where the research and development expenditures of a firm correlate with future returns in a nonlinear way [5]. Similar ideas have been widely used in the machine learning community under the context of polynomial kernel learning [6, 7], and the kernel trick has been introduced to linear inverse problems by Qi and Hughes [25]. A related model that learns interactions among variables is the multi-linear model [26], where columns that involve auto-interaction are removed from the polynomial model.

III-A Basis expansion

Recall that in our inverse problem, ${\bf y}={\bf f(X)}+{\bf z}$ , we define measurement $m\in\{1,\ldots,M\}$ as $y_{m}=f(\mathbf{x}_{m})+{z_{m}}$ (1). Linear inverse problems make use of models that are linear in the input factors; they are mathematically and algorithmically tractable, and can be interpreted as a first-order Taylor approximation to $f(\mathbf{x})$ [23]. However, in many applications, the true function $f(\mathbf{x})$ is far from linear in $\mathbf{x}$ .

A basis function expansion replaces $\mathbf{x}$ with transformations of $\mathbf{x}$ [23]. For $\ell\in\{1,2,\ldots,L\}$ , $f(\mathbf{x})$ is expressed as in the linear basis expansion of ${\mathbf{x}}$ :

[TABLE]

This model is linear in the new variable $g_{\ell}(\mathbf{x})$ , and $\theta_{\ell}$ are the coefficients. Basis expansions allow us to use a linear model to characterize and analyze nonlinear functions.

III-B Polynomial regression

We form a polynomial regression problem by applying a Taylor expansion to the multivariate nonlinear function $f(\cdot)$ [24]. In polynomial regression, we add to the original columns of the measurement matrix ${\bf X}_{Q}$ , which represent individual explanatory variables, extra columns that represent interactions among variables.

Let us elaborate on the quadratic case. While we will provide details of a matrix ${\bf X}_{Q}$ , that supports a quadratic Taylor expansion (6), the reader should be able to employ this concept for cubic expansions and beyond. For each measurement, we use a Taylor expansion of the $N$ factor variables:

[TABLE]

where $\theta_{1}$ is a constant, $\boldsymbol{\theta_{2}},\boldsymbol{\theta_{3}}\in\mathbb{R}^{N}$ are coefficient vectors for linear and quadratic terms, respectively, $\boldsymbol{\theta_{4}}\in\mathbb{R}^{\frac{N(N-1)}{2}}$ is a coefficient vector for cross terms, and the subscript $n$ in $[{\boldsymbol{\theta}}_{4}]_{n}$ depends on $n_{1}$ and $n_{2}$ .

Our quadratic Taylor approximation is a basis expansion, where we have chosen $g(\mathbf{x})$ as follows: (i) $g(\mathbf{x})=1$ corresponds to a DC constant (ii) $N$ linear terms corresponding to the original data, $g(\mathbf{x})=x_{n}$ , $n\in\{1,\ldots,N\}$ ; (iii) $N$ quadratic terms corresponding to squares of individual linear terms, $g(\mathbf{x})=(x_{n})^{2}$ ; and (iii) $\frac{N(N-1)}{2}$ cross terms corresponding to products of pairs of linear terms, $g(\mathbf{x})=x_{n_{1}}x_{n_{2}}$ , where $n_{2}>n_{1}$ , $n_{1},n_{2}\in\{1,\ldots,N\}$ . We assume that the features matrix, $\mathbf{X}$ , is i.i.d. zero mean Gaussian for ease of analysis; different types of $\mathbf{X}$ are left for future work.

The polynomial regression model is formulated as a linear inverse problem (3) in matrix vector form,

[TABLE]

where ${\boldsymbol{\theta}}\in{\mathbb{R}^{L}}$ is the coefficient vector, and $L$ is evaluated below (8). In our matrix $\mathbf{X}_{Q}$ (6), each row is an instance or sample, and each column is an attribute or feature.

Our goal is to estimate the regression coefficients in the vector ${\boldsymbol{\theta}}$ from $\mathbf{X}_{Q}$ and $\mathbf{y}$ . The measurement matrix $\mathbf{X}_{Q}\in{\mathbb{R}^{M\times L}}$ will include one DC column, $N$ linear term columns, $N$ quadratics (squared column), and $\frac{N(N-1)}{2}$ cross terms. This matrix has the form (6), and it can be seen that

[TABLE]

To solve this linear inverse problem using an AMP-based approach, we normalize each column of $\mathbf{X}_{Q}$ , $[\mathbf{X}_{Q}]_{\ell}$ to have unit norm, where $\ell\in\{1,2,\dots,L\}$ , and denote this normalized matrix by $\mathbf{X}_{Q}^{\prime}$ ,

[TABLE]

where each entry of $[\mathbf{X}_{Q}]_{\ell}$ obeys

[TABLE]

and the regression coefficients satisfy

[TABLE]

III-C SVD of normalized quadratic matrix $\mathbf{X}_{Q}^{\prime}$

While the normalized matrix $\mathbf{X}_{Q}^{\prime}$ converts our quadratic nonlinear inverse problem into a linear one, it contains dependencies between linear and quadratic columns as well as between the linear and cross terms. Unfortunately, it is well known that many solvers for linear inverse problems struggle with such matrices.

Surprisingly, our matrix (6) works well within some AMP-based approaches, as will be demonstrated by numerical results in Section IV. Why does our matrix perform well within AMP? Despite containing dependencies between columns, these dependencies are nonlinear in nature, and linear correlations between columns turn out to be mild. In fact, a singular value decomposition (SVD) of $\mathbf{X}_{Q}^{\prime}$ reveals that it is reasonably well-conditioned. In particular, we have seen numerically that most of the singular values (SVs) seem to follow the semicircle law. That said, the first (largest) SV is larger than suggested by the semicircle law.

To see why the first SV, $\sigma_{1}$ , is larger, recall that $\mathbf{X}_{Q}^{\prime}$ is comprised of one DC column, $N$ linear term columns, $N$ quadratic ones, and $\frac{N(N-1)}{2}$ cross term columns. Because $\mathbf{X}_{Q}^{\prime}$ has unit norm columns, entries of the DC column are $1/\sqrt{M}$ , and so the sum of elements of the first column is $\sqrt{M}$ . The $N$ quadratic columns are non-negative, and because they too have unit norm, the average squared value is $1/M$ , suggesting that the average is $\Theta(1/\sqrt{M})$ . The sums of elements of all $N$ linear and $\frac{N(N-1)}{2}$ cross term columns are near zero, because these are zero mean Gaussian random variables (RVs), and products of zero mean Gaussian RVs, respectively. We see that the first SV, $\sigma_{1}$ , corresponds to an all constant (or roughly all constant) column multiplied by a row that contains significant non-zero entries corresponding to the DC column and $N$ quadratic columns, while row entries corresponding to linear and cross term columns are close to zero.

Under some assumptions, we can estimate the amount of energy represented by the first SV, $\sigma_{1}^{2}$ . Suppose that the original linear columns are Gaussian, $X\sim{\cal{N}}(0,1)$ . Under this assumption, the quadratic element $\chi=X^{2}$ has a chi-squared distribution, where $E[\chi]=E[X^{2}]=1$ and $\text{var}[\chi]=2$ . Therefore, $E[\chi^{2}]=E[\chi]^{2}+\text{var}(\chi)=3$ . As we will need to normalize individual entries of quadratic terms by roughly $\sqrt{3M}$ , the average energy of the DC component of these columns is $1/3$ . Similarly, it can be shown that linear and cross term columns have average energy $1/M$ aligned with the first singular column vector. In summary, the energy in $\sigma_{1}^{2}$ is comprised of (i) unit energy for the DC column; (ii) $N/M$ for the $N$ linear columns; (iii) $N/3$ for the $N$ quadratic ones; and (iv) $\frac{N(N-1)}{2M}$ for cross term columns. Therefore, we predict the total energy in $\sigma_{1}$ to obey

[TABLE]

Our analysis of the first singular value is inaccurate, because the first singular vector column is only roughly constant, and while computing the SVD this column is modified in order to maximize the energy of the first rank-one component. Therefore, $\sigma^{2}_{1,pred}$ can be interpreted as a lower bound for $\sigma^{2}_{1}$ . That said, numerical experiments presented in Table I show that our prediction (11) provides a reasonable approximation. In the table, results for several $(M,N)$ pairs are provided. For each pair, we average empirical values for $\sigma_{1}^{2}$ , the energy in the first SV, over 20 matrices; these empirical averages are compared to the prediction (11). It can be seen that $\sigma_{1}^{2}$ is typically larger by 0.6–0.75; seeing that unit norm columns in the normalized matrix $\mathbf{X}_{Q}^{\prime}$ imply that the average SV has unit energy, this extra energy seems plausible.

Finally, although we have focused on the normalized quadratic matrix, $\mathbf{X}_{Q}^{\prime}$ , in further numerical work (not reported here) we evaluated a cubic matrix with normalized columns. It too has an SVD where $\sigma_{1}$ is larger while other SVs seem to follow the semicircle law.

III-D AMP-based algorithm

We solve our linear inverse problem (9) using AMP, where two points should be highlighted. First, our denoiser can incorporate the Bayesian prior information. Specifically, we use conditional expectation denoisers that minimize the MSE [17]. Second, owing to the structure of our matrix (Section III-C), various AMP variants that promote convergence can be used [19, 27, 18]. That said, these variants all have their shortcomings, and possible divergence of AMP should be tracked carefully.

IV Numerical Results

Our construction of the quadratic polynomial regression model in Section III results in a linear inverse problem (3) whose solution forms an estimate of a multivariate nonlinear function (2) that relates the explanatory variables to the phenomena of interest. This resulting linear inverse problem will now be solved by two AMP-based approaches, Bayesian AMP and empirical Bayes.

IV-A Bayesian AMP

Non-i.i.d. model for ${\boldsymbol{\theta}}$ : Our Bayesian approach considers four groups of coefficients (7), where $\theta_{1}\in\mathbb{R}$ , $\boldsymbol{\theta_{2}},\boldsymbol{\theta_{3}}\in\mathbb{R}^{N}$ , and $\boldsymbol{\theta_{4}}\in\mathbb{R}^{\frac{N(N-1)}{2}}$ are the DC, linear, quadratic, and cross term coefficients, respectively. We modeled each individual entry among these $L$ coefficients as Bernoulli Gaussian (BG), where the Bernoulli part is a probability $p$ that the entry is nonzero, in which case its distribution is zero mean Gaussian with some variance. To be specific, (i) our DC coefficent obeys $\theta_{1}\sim\mathcal{N}(0,10)$ , meaning that it is zero mean Gaussian with variance 10; (ii) each entry among the $N$ linear term coefficients satisfies $[\boldsymbol{\theta}_{2}]_{n}\sim 0.2\mathcal{N}(0,1)+0.8\delta_{0}$ , i.e., zero mean unit norm Gaussian with probability 0.2, else zero; (iii) the $N$ quadratic term coefficients obey $[\boldsymbol{\theta}_{3}]_{n}\sim 0.2\mathcal{N}(0,0.5)+0.8\delta_{0}$ ; and (iv) for the $\frac{1}{2}N(N-1)$ cross term coefficients, $[\boldsymbol{\theta}_{4}]_{n}\sim 0.03\mathcal{N}(0,0.1)+0.97\delta_{0}$ . Although the four groups of coefficients have different distributions, all $L$ entries that follow this model are statistically independent.

Baseline LASSO algorithm: The baseline algorithm used to solve (9) is the least absolute shrinkage and selection operator (LASSO) [28], which minimizes the sum of squared errors subject to a constraint on the $\ell_{1}$ norm of the coefficients [28]. In our polynomial model, the LASSO estimator $\widehat{\boldsymbol{\theta}}$ is calculated in Lagrangian form:

[TABLE]

where $\lambda_{1},\ldots,\lambda_{4}$ are tuning parameters. In principle, we could perform grid search over all four parameters, $\lambda_{1},\ldots,\lambda_{4}$ , but it is computationally intractable. Therefore, we report the performance obtained by setting all parameters to be equal, which reduces the search space.

AMP-based approach: As a proof of concept, we have designed a denoiser specifically for our non-i.i.d. model. Because all $L$ entries that follow this model are statistically independent, we used $L$ scalar denoisers. However, because individual entries among our four groups of coefficients, $\theta_{1}\in\mathbb{R}$ , $\boldsymbol{\theta_{2}},\boldsymbol{\theta_{3}}\in\mathbb{R}^{N}$ , and $\boldsymbol{\theta_{4}}\in\mathbb{R}^{\frac{N(N-1)}{2}}$ follow different distributions, four different scalar denoisers were used. Details of Bayesian denoisers for BG signals appear in [29].

Signal generation: We evaluate the performance of AMP in the Bayesian setting, which is a planted inference problem. The experiment allows us to validate the suitability of AMP for the quadratic basis, e.g. (6).

We generated the feature matrix, $\mathbf{X}$ , as i.i.d. Gaussian with dimension $N=100$ . These linear terms were then transformed into a quadratic form $\mathbf{X}_{Q}^{\prime}$ with normalized columns (9). The number of columns in the normalized matrix was $L=5151$ (8), and the number of rows $M=5400$ , Next, we created quadratic multivariate functions by generating ${\boldsymbol{\theta}}$ vectors following our non-i.i.d. model. The expected energy of each group of coefficients satisfies $E_{DC}=10$ , $E_{linear}=0.2\times N=20$ , $E_{quadratic}=0.2\times N\times 0.5=10$ , and $E_{cross}=0.03\times\frac{N^{2}-N}{2}\times 0.1=14.85$ . Finally, the measurement noise ${\bf z}$ was AWGN with variance $\sigma_{Z}^{2}=0.004$ .

MSE performance: Fig. 1 shows the MSE performance for estimated coefficients, ${\boldsymbol{\theta}}$ . We estimated the coefficients using LASSO, swept AMP (SwAMP) [19] and vector AMP (VAMP) [27]. The left panel of the figure shows the MSE obtained when estimating the original coefficients $\boldsymbol{\theta}$ , where the estimator $\widehat{\bf{\theta}}$ can be calculated using (13),

[TABLE]

$l\in\{1,\ldots,L\}$ , and $\widehat{\boldsymbol{\theta}^{\prime}}$ are estimated coefficients of ${\boldsymbol{\theta}^{\prime}}$ . SwAMP and VAMP both converge well for normalized quadratic matrices. However, it can be seen in Fig. 1 that VAMP requires less than one hundred iterations to converge; SwAMP requires a few hundred, and its individual iterations require more computation than those of VAMP; our specific implementation of LASSO requires thousands of iterations. Because our AMP based approaches are expected to be Bayes optimal while LASSO does not share these optimality properties, there is no surprise that AMP-based approaches obtain lower MSE.

To make sure that our function reflects the nonlinear function well, the right panel of Fig. 1 shows the MSE obtained when applying our estimated polynomial function to predict test data,

[TABLE]

where we held back $K=600$ test measurements (recall that $M=5400$ ), $\mathbf{X}_{test}\in\mathbb{R}^{K\times L}$ has the same format as $\mathbf{X}_{Q}$ , and $\mathbf{y}_{test}\in\mathbb{R}^{K}$ . Note that the MSE for coefficients, $\boldsymbol{\theta}$ , is inapplicable to real-world problems, because the true coefficients do not exist, and we are merely modeling some nonlinear dependence as a low-order Taylor series. In our synthetic experiment, we are using the MSE over the test data as a metric of interest.

IV-B Empirical Bayes

Nonlinear function: Nonlinear function learning is now performed using empirical Bayes within AMP [15]. We employ the quadratic formulation (9) and learn the coefficients vector ${\boldsymbol{\theta}}$ to approximate a family of (mildly) nonlinear functions,

[TABLE]

where $w_{1}=0.1$ , $w_{2}=0.3$ , and $w_{3}=0.6$ are weights of the sinusoids, ${\boldsymbol{\rho}}_{i}\in\mathbb{R}^{N}$ is a BG vector, ${\boldsymbol{X}}{\boldsymbol{\rho}_{i}}\in\mathbb{R}^{M}$ , ${\boldsymbol{\phi}}\in\mathbb{R}^{M}$ are phase shifts uniformly distributed between 0 and $2\pi$ , the sine is applied element-by-element, and the noise $\boldsymbol{z}\in\mathbb{R}^{M}$ is AWGN with variance $10^{-4}$ . Note that the vectors ${\boldsymbol{\rho}}_{i}$ are chosen to be sparse BG, in order for the coefficients vector $\boldsymbol{\theta}$ fit by AMP to the quadratic expansion to also be sparse.

AMP-based empirical Bayes: In contrast to the Bayesian case, we assume that $\boldsymbol{\theta}_{2}$ , $\boldsymbol{\theta}_{3}$ , and $\boldsymbol{\theta}_{4}$ are BG, and their parameters are estimated using maximum likelihood (ML) in each AMP iteration. The DC coefficient $\theta_{1}$ is assumed to be Gaussian. The ML parameters are plugged into Bayesian denoisers for the 4 components.

MSE performance: We generated nonlinear functions and ran our empirical Bayes algorithm, LASSO, and a pseudoinverse approach (least squares). Each run of LASSO requires many iterations, and we use cross validation to regularize the parameter selection procedure. AMP with damping requires fewer iterations than LASSO. Empirical results for different measurement rates, $R=M/L$ , appear in Table. II. AMP obtains lower MSE than LASSO, which in turn obtains lower MSE than pseudoinverse.

V Discussion

In this paper, we studied nonlinear function estimation, where a nonlinear function of interest is regressed on a set of features. We linearized the problem by considering low-order polynomial kernel expansion, and solved the resulting linear inverse problem using approximate message passing (AMP). Numerical results confirm that our AMP-based approaches learn the function better than the widely used least absolute shrinkage and selection operator (LASSO) [28], offering markedly lower error in predicting test data for both Bayesian and non-Bayesian settings.

While we have presented a first step toward estimating nonlinear functions by appling AMP to polynomial regression, many open problems remain.

Dependencies between coefficients: In past work, we used non-scalar sliding window denoisers to process coefficient vectors ${\boldsymbol{\theta}}$ that contained dependencies between entries [22, 14]. It is not clear whether similar dependencies will appear in our ${\boldsymbol{\theta}}$ . While it seems plausible that exposure weights corresponding to the $N$ original columns, the $N$ quadratic terms, and $N(N-1)/2$ cross terms will have different distributions, it is not clear whether each group is i.i.d. or contains intra-group dependencies. In ongoing work, we are processing all terms corresponding to the same original column (the original column, its quadratic, and $N-1$ associated product columns) together, which could be processed with block denoising. This form of joint processing will support possible dependencies between lower order Taylor coefficients and higher order ones; such dependencies have been noted between parent and children wavelet coefficients [30].

Other kernels: In this paper, we considered a second-order polynomial kernel. Future work will naturally extend to selecting the degree of the polynomial kernel as well. Further, we will consider other widely used kernels.

Results on real datasets: While we reported promising results for nonlinear function estimation with AMP in Bayesian and empirical Bayes settings, the performance of our algorithms must be tested on real datasets. In these datasets, various problems may appear, for example the prior is unavailable; the measurement matrix may be poorly conditioned; the function of interest may not belong to the hypothesis class; and the noise may be heavy tailed [2], resulting in a mismatched estimation problem. We will explore the application of more advanced adaptive variants of AMP in the absence of a known prior [12, 13, 14]. When the true function does not belong to the hypothesis class, which are polynomials of degree two or three in this paper, the best one can hope for is to recover the function of interest up to a projection error onto the hypothesis class. We will also explore the usual bias/variance trade-offs that arise in such settings.

Nonlinear acquisition and reconstruction: Since the work of Gauss and his contemporaries [31], hardware designers have been keenly aware that the mathematics involved in processing linearly obtained measurements is more mature than that for nonlinear measurements. However, algorithms that estimate multivariate nonlinear functions can also be used to reconstruct signals measured nonlinearly. The same polynomial kernels [6, 7] used above to expand the matrix can also be used to approximate a nonlinear function with a linear one. Such advances will allow designers to stop worrying about the nonlinearities inherent in many hardware systems.

Acknowledgment

The authors are greatly indebted to Yanting Ma, who demonstrated favorable preliminary results for various AMP variants on a quadratic matrix. DB also thanks Andrew Barron for helping him appreciate the importance of nonlinear function estimation. Finally, this work was partly supported by the National Science Foundation under Grant Nos. ECCS-1611112 and CNS 16-24770, and the industry members of the Center for Advanced Electronics in Machine Learning.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. A. Dalton and E. R. Dougherty, “Optimal classifier with minimum expected error within a Bayesian framework - Part 1: Discrete and Gaussian model,” Pattern Recognition , vol. 46, pp. 1301–1314, Nov. 2012.
2[2] R. C. Grinold and R. N. Kahn, Active portfolio management: a quantitative approach for providing superior returns and controlling risk . Mc Graw-Hill Companies, 2000.
3[3] E. Fama and K. French, “Common risk factors in the returns on stocks and bonds,” J. Finan. Econ. , vol. 33, no. 1, pp. 3–56, 1993.
4[4] N. Jegadeesh and S. Titman, “Returns to buying winners and selling losers: Implications for stock market efficiency,” J. Finance , vol. 48, no. 1, pp. 65–91, 1993.
5[5] L. K. Chan, J. Lakonishok, and T. Sougiannis, “The stock market valuation of research and development expenditures,” National Bureau of Economic Research, Tech. Rep., 1999.
6[6] J. Fan, N. E. Heckman, and M. P. Wand, “Local polynomial kernel regression for generalized linear models and quasi-likelihood functions,” J. Amer. Stat. Assoc. , vol. 90, no. 429, pp. 141–150, 1995.
7[7] K. I. Kim, K. Jung, and H. J. Kim, “Face recognition using kernel principal component analysis,” IEEE Signal Process. Lett. , vol. 9, no. 2, pp. 40–42, 2002.
8[8] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proc. Nat. Academy Sci. , vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Nonlinear Function Estimation

Abstract

Index Terms:

I Introduction

II Background

II-A Inverse problems

II-B Approximate message passing (AMP)

II-C Non-scalar denoisers

III Learning nonlinear functions

III-A Basis expansion

III-B Polynomial regression

III-C SVD of normalized quadratic matrix XQ′\mathbf{X}_{Q}^{\prime}XQ′​

III-D AMP-based algorithm

IV Numerical Results

IV-A Bayesian AMP

IV-B Empirical Bayes

V Discussion

Acknowledgment

III-C SVD of normalized quadratic matrix $\mathbf{X}_{Q}^{\prime}$