An Infinite Dimensional Analysis of Kernel Principal Components
Palle E.T. Jorgensen, Sooran Kang, Myung-Sin Song, Feng Tian

TL;DR
This paper develops a new kernel PCA framework for nonlinear data dimension reduction, extending classical PCA with manifold and feature space transforms, and provides theoretical insights into optimal kernel choices.
Contribution
Introduces a novel kernel PCA method with theoretical analysis for optimal Gaussian kernel selection and extends probabilistic Karhunen-Loève transforms to nonlinear settings.
Findings
Proves new theorems for data-dimension reduction.
Identifies conditions for optimal Gaussian kernel choice.
Enhances digital image representation and compression.
Abstract
We study non-linear data-dimension reduction. We are motivated by the classical linear framework of Principal Component Analysis. In nonlinear case, we introduce instead a new kernel-Principal Component Analysis, manifold and feature space transforms. Our results extend earlier work for probabilistic Karhunen-Lo\`eve transforms on compression of wavelet images. Our object is algorithms for optimization, selection of efficient bases, or components, which serve to minimize entropy and error; and hence to improve digital representation of images, and hence of optimal storage, and transmission. We prove several new theorems for data-dimension reduction. Moreover, with the use of frames in Hilbert space, and a new Hilbert-Schmidt analysis, we identify when a choice of Gaussian kernel is optimal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Image Processing Techniques and Applications · Mathematical Analysis and Transform Methods
\RS@ifundefined
subsecref \newrefsubsecname = \RSsectxt
\RS@ifundefinedthmref \newrefthmname = theorem
\RS@ifundefinedlemref \newreflemname = lemma
\newreflemrefcmd=Lemma LABEL:#1 \newrefthmrefcmd=Theorem LABEL:#1 \newrefcorrefcmd=Corollary LABEL:#1 \newrefsecrefcmd=Section LABEL:#1 \newrefsubrefcmd=Section LABEL:#1 \newrefsubsecrefcmd=Section LABEL:#1 \newrefchaprefcmd=Chapter LABEL:#1 \newrefproprefcmd=Proposition LABEL:#1 \newrefexarefcmd=Example LABEL:#1 \newreftabrefcmd=Table LABEL:#1 \newrefremrefcmd=Remark LABEL:#1 \newrefdefrefcmd=Definition LABEL:#1 \newreffigrefcmd=Figure LABEL:#1 \newrefquerefcmd=Question LABEL:#1
An Infinite dimensional Analysis of Kernel Principal Components
Palle E.T. Jorgensen
(Palle E.T. Jorgensen) Department of Mathematics, The University of Iowa, Iowa City, IA 52242-1419, U.S.A.
,
Sooran Kang
(Sooran Kang) College of General Education, Choongang University, Seoul, Korea
,
Myung-Sin Song
(Myung-Sin Song) Department of Mathematics and Statistics, Southern Illinois University Edwardsville, Edwardsville, IL 62026, USA
and
James Tian
(James F. Tian) Mathematical Reviews, 416 4th Street Ann Arbor, MI 48103-4816, U.S.A.
Abstract.
We study non-linear data-dimension reduction. We are motivated by the classical linear framework of Principal Component Analysis. In nonlinear case, we introduce instead a new kernel-Principal Component Analysis, manifold and feature space transforms. Our results extend earlier work for probabilistic Karhunen-Loève transforms on compression of wavelet images. Our object is algorithms for optimization, selection of efficient bases, or components, which serve to minimize entropy and error; and hence to improve digital representation of images, and hence of optimal storage, and transmission. We prove several new theorems for data-dimension reduction.
Key words and phrases:
dimension reduction, principal component analysis, eigenvalues, optimization, Karhunen-Loève transform, optimal storage, transmission, wavelet image compression, frames, entropy encoding, algorithms, Hilbert space.
2000 Mathematics Subject Classification:
62H25, 34L16, 65K10, 65T60, 42C15, 47B06, 47B32, 65D15, 47A70, 37M25, 42C40.
Contents
Data and digital image illustrations
- 2.1 Spectral clustering and linear decision boundary via Gaussian kernels and KPCA.
- 2.2 Non-linear data, and detection of rotation angles via KPCA.
1. Introduction
Recently a number of new features of principal component analysis (PCA) have lead to exciting and new improved dimension reduction (DR). See e.g., [BN03, GGB18, JHZW19, GJ06, Bis06, Bis13, ZBB04, AFS18, VD16]. In general DR refers to the process of reducing the number of random variables under consideration in such areas as machine learning, statistics, and information theory. Within machine learning, it involves both the steps of feature selection and feature extraction. In the present paper, we shall consider linear as well as non-linear data models. The linear case arises naturally in principal component analysis (PCA). See [Son08, JS07]. Here one starts with a linear mapping of the given data into a suitable lower-dimensional space. However, this must be done in such a way that the variance of the data in the low-dimensional representation is maximized. As for the variance, we study both covariance, and the correlation matrix for the underlying data. The eigenvectors that correspond to the largest eigenvalues (the principal components) are then used in a construction of a large fraction of the initial variance, i.e., that which corresponds to the original data. The first few eigenvectors are typically interpreted in terms of the large-scale physical behavior of a particular system; and will retain the most important variance features.
In nonlinear settings, principal component analysis can still be adapted, but now by means of suitable kernel tricks. In some applications, instead of starting from a fixed kernel, the optimization will instead try to learn, or adapt, for example with the use of semidefinite programs [WGY10, BM03]. The most prominent such a technique is known as maximum variance unfolding (MVU) [ACP13, SBXD06].
The aim of this paper is to set the stage for further developments of the linking of the mathematics of kernel-PCA (KPCA) with real-life non-linear data-sets, coming from a host of new developments in the current applied literature. A key focus for our work is a study of two complementary directions: on one side, the mathematics of kernel theory, and on the other, empirical (here, numerical experiments starting with databases from real-world situations). For the purpose, we develop (Section 2 ) particular kernel tools with view to KPCA via statistical simulation, as well as corresponding analyses of kernel based choices of feature spaces and feature maps.
In recent years, the subject of kernel-principal component analysis (KPCA), and its applications, has been extensively studied, and made progress in diverse directions. There is a variety of themes covered in the literature dealing with applications of kernel-PCA tools is vast, each of some relevance to our present theme. A short user guide: the paper [GGB18] deals with sensitivity issues, [JHZW19] direct integration algorithms, [GJ06] image processing, [Bis06, Bis13, AFS18] pattern recognition and machine learning, [ZBB04] statistical properties, and finally [VD16] on selection of efficient parameterizations. In addition to earlier papers by the co-authors [Son08, JS07], we also include here a partial list of other relevant and current citations; see e.g., the papers mentioned above, [LHN18, Raj18, WGLP19, RS99, GGB18, DWGC18, THH19, LLY*+*19, MMP19, CLS*+*19, GK19].
1.1. Reproducing kernels and operator theory
In our present paper we stress, and motivate, the most general framework for choices of positive definite kernels, their corresponding RKHSs, and families of compatible measure spaces. These considerations encompass more general contexts for Principal Component Analysis, going beyond classical cases of kernel-based PCA in the literature. In particular, this very general and non-linear framework for PCA data-dimension reduction goes beyond existing, more classical, considerations admitting easy realizations in , classical kernels, and classical Fourier tools. Indeed, there are existing parallel considerations, in the literature involving only kernel analyses for such special cases.
However, existing non-linear realizations of big and un-structured data motivate choices from a much wider family of RKHSs, and measure spaces. This further entails infinite-dimensional stochastic analysis frameworks which do not admit realizations in , nor classical PDE/Fourier techniques. Examples of uses, in the literature, of general RKHSs in stochastic approaches include online learning with Markov sampling, data mining (e.g., in finance, intelligence, hidden data, tele-communication, energy, unstructured information integrated with traditional structured data), as well as a diverse host of different investigations, see e.g., [ACP13, AHR*+*20, AJS14, Bis06, BM03, BTA04, CS02, CWG19, JT16a, JT18a, JT18b, Mok07, SZ09a, SZ09b, ZBB04, JT17, Agg18, Hog20]. Moreover “big data” further dictate diverse, large, infinite-dimensional, and extremely non-linear data-sets which do not lend themselves to feature analysis with choices of anyone of the simple kernels with easy realizations in . Examples of this includes many public available datasets on Global Health Observatory, U.S. Census Bureau, NIH Data Sharing Repositories, Kaggle, etc. Such cases of course dictate new and diverse choices from a much wider family of kernels and RKHSs. Encoding of meaningful programs for corresponding non-linear data-dimension reduction algorithms, while feasible, is a gigantic task, going far beyond the scope of our present paper. It is the subject of future projects involving the present authors, as well as others, see e.g., the papers cited above.
In addition, we call attention to the 2018 paper [YTLM*+*18] by Yamada et al. It is motivated by existing and new biological data. For the purpose, the authors introduce a novel Hilbert-Schmidt independence criterion. They apply it to specific high dimensional, and large-scale datasets, as arising in ultra-high dimensional biological data. The authors obtain classification into phenotypes, module expression in human prostate cancer patients, and detection of enzymes in protein structures. The 2019 paper [CLW*+*19] by Chen et al deals with a new class of kernel based nonlinear clustering algorithms. They study locality structures in machine learning. Also motivated by specific large-scale datasets is the 2020 paper [UTK20]. This paper introduces new hyper-parameter initialization tools for kernel-based regression analysis. Yet different but related kernel based technology, and specific data sets, are further covered in the papers [FZ05], [WSS04], [MMR*+*01], and [DSR*+*21].
Motivated by related applications, we focus on non-traditional positive definite kernels and the associated RKHSs, in particular, when the family of feature spaces may be chosen in the form . This raises the question of which measures are right for a particular kernel and its RKHS.
In the use of probabilistic tools in big data, such as Monte Carlo simulation (see e.g., (1.1) below), a main issue is reduction of the required sample size; and still maintain (i) acceptable small noise in output, (ii) small error term estimates, as well as (iii) reduction of computation time. These issues are especially relevant for modern image processing. As a consequence, more traditional kernel tools must therefore be adapted. Both our present paper, and an earlier one [JT22], serve to do this by shifting the focus in the design of optimization algorithms to that of making choices of measurable partitions, as opposed to working directly with points.
For instance, when is defined on a -finite measure space , where , for all in , the associated RKHS is shown to be a Hilbert space of signed finite measures on , such that and ; see [JT22, Thms 3.2, 3.4]. Moreover, in this measurable setting, the classical Karhunen-Loève decomposition takes the following form instead:
[TABLE]
where is a centered Gaussian process indexed by , is an orthonormal basis (ONB) in , and is an i.i.d. system. These new kernels encompass realistic cases when random measurements (random variables) might not be performed at points, but instead at sample sets that are selected from suitable choices of sigma-algebras. For more details on the more general context of Parseval frames in the measure category, see 2.2 below, and [JT22, Cor 3.13 and sect. 5].
Main results. The presentation of our main results is organized as follows: 2 covers multiple interrelated tools, playing a key role in our main results. The first of these tools is our use of kernels (also called reproducing kernels) and their associated Hilbert spaces, called reproducing kernel Hilbert spaces (RKHSs.) The use of these RKHSs yields powerful tools, and they allow explicit realizations of a variety of embeddings of non-linear structures into linear ones (Hilbert spaces). This is useful in turn for our solution to optimization questions, see 2.14 and 2.16. Our results are tested on datasets, see Figures 2.1 and 2.2.
Organization. The goal of this paper is to extend our previous results on Karhunen-Loève transform to a nonlinear setting by means of kernel-principal component analysis (KPCA). This paper is organized as follows:
In 2, we show our main results using KPCA on nonlinear data. See 2.9. The latter is for rank-1 projections, but is then extended to PCA selection, and algorithms, in subsequent results. Our focus is the non-linear case. Indeed, the focus in 2 is nonlinear data dimension reduction, and the corresponding kernels, the core of our paper. In particular, our 2.11 deals with kernel PCA for nonlinear data dimension reduction (see Examples 2.17 and 2.18). In the remaining part of 2, we turn to the case of Gaussian kernels.
KPCA have found wide applications. The focus of our current paper is to present a framework of operator in suitable Hilbert space, and an associated spectral theory. Even though the applications we include here are presented in finite dimensional setting, most of our results extend to infinite dimensional spaces as well. Nonetheless, for use in recursion schemes, the finite-dimensional case is most relevant.
Our main results deal with algorithms for optimization in maximal variance, and dimension reduction-problems, from PCA. The points where our results go beyond the earlier literature includes the following list of four closely interrelated items: (i) Our design and use of kernel tricks (specific reproducing kernel Hilbert spaces (RKHSs) integrated into new designs of PCA-tools for image-analysis, thus serving as practical tools in dimension-reduction algorithms; (ii) Our combination of RKHS-tools with spectral theory and Hilbert space frame-estimates; — this serves in turn to make more precise both specific numerical PCA-algorithms, and their error estimates; (iii) Our use of RKHSs in creating explicit and practical embeddings of non-linear data into suitable linear spaces (feature spaces); — applications to optimization (2.1); and (iv) A new dynamical PCA-analysis (2.6)
Our results in 2 include 2.9, 2.11, and 2.16; each of which are formulated and proved in a general setting of kernel analysis; hence in a non-linear framework of feature selection.
Our main purpose is the identification and proof of rigorous results from infinite-dimensional analysis which are needed in the solution of optimization problems in the general framework of Principal Component Analysis with an emphasis on the role of PCA-algorithms for selection of features in infinite dimensions. Main points in the paper include our infinite-dimensional optimization results, Theorems 2.5, 2.6, 2.11, Remark 2.15, and Corollaries 2.16, and 2.20.
2. Kernel PCA
Comparison of KPCA to standard PCA (brief sketch.) The standard PCA always finds linear principal components. It serves to represent a given large data set into a suitable choice of principal components for lower dimension. But PCA will fail to find good representative directions when applied to most non-linear data of interest. For non-linear and more tricky data sets, one turns instead to Kernel PCA (KPCA). It has the following advantages and features (not a complete list): (i) KPCA will perform PCA but in a new space. (ii) It uses diverse kernel tricks in order to still find principal components, but they will be in a different space (typically in a higher dimensional space.) (iii) KPCA will use Hilbert space geometry in order to still find new directions based on families of kernel-matrices. Via geometric algorithms, KPCA will then extract corresponding eigenvalues (corresponding to a suitable number of observations.) However, (iv) the computational complexity for KPCA, dictated by extraction of principal components, will take more time, as compared to standard PCA. The list of papers on KPCA and applications to diverse sets of non-linear data is long; some recent ones are [HKF*+*20, ZSS19, Law05, AH11, HLMS04, Nai17, FCB].
In this section, we make precise, and show that the choice of Gaussian kernel is “optimal” for KPCA in capturing maximal variance (i.e., maximal variability of data). There is an extensive literature on PCA and KPCA. Our approach and emphasis are different. For instance, the paper [RT01] deals with certain Least Squares Regression problems with the use of a particular (small) class of Reproducing tools involving Kernel Hilbert Space. While there are connections to our present themes, our present results are much more general, offering in particular specific spectral theoretic, and stochastic tools, penalty terms, especially relevant for digital images, and for dynamical PCA. The focus of [SSM98] is the design of certain Nonlinear Component Analysis as a particular class of Kernel Eigenvalue Problems. By comparison, our present results and applications have a different focus, use different tools both as regards to infinite-dimensional analysis, optimization, and Gaussian fields; as we outline below. This paper [KW71] deals with a particular class of splines (numerical analysis). This in turn entails a class of kernels. But the focus of [KW71] is quite different from ours. The paper [WMd97] deals with matrix algorithms close in spirit to our present paper, but our focus and applications are global in nature, and are very different as we outline below. The book [CST00] deals with kernel methods and some of their applications to support vector machines. This is an exciting application of feature selections, and learning algorithms, but the aim and the tools developed in this book are quite different from our present focus.
This allows us to obtain extensions of some PCA-results from [JS07]. We shall focus here on their use in kernel PCA (KPCA).
PCA is used in data dimension reduction on linear case. However, this cannot be done on nonlinear case and thus kernel principal component analysis (KPCA) is used for nonlinear dimension reduction. See, e.g., [SZ09a, SZ09b, SZ07, SY06, PS03, CS02].
Standard PCA is effective at identifying linear subspaces carrying the greatest variance in a data set. However, this method is not able to detect nonlinear submanifolds. A popular technique to tackle the latter case is kernel PCA. It first maps data into a higher dimensional space , and performs PCA there. Here is the reproducing kernel Hilbert space (RKHS) associated with a given positive definite kernel . (Recall that an RKHS is a Hilbert space of functions defined on a set , such that the evaluation functionals are bounded for all . See also 2.2 below.) The mapping in this context presumably sends a nonlinear submanifold in the input space to a linear subspace in . For example, in classification problems, a kernel is usually chosen so that the mapped data can be separated by a linear decision boundary in (see 2.1).
Remark 2.1*.*
It would be intriguing to compare Smale’s Dimension Reduction algorithm from [SZ09a] with ours. The two approaches are along a different lines of development.
The approach in Belkin’s paper [BN03] is popular in current Machine Learning research. Both our results and those of Belkin et al aim for dimension reduction algorithms. Other methods exist, which constitute variants of KPCA, but with different choices of kernels, and with the Laplacian eigenmap (LE) as one of them. (See also [CWG19, VVQCR*+*19, TF19, SGS*+*19].) For recent developments on graph Laplacians, and Perron-Frobenius eigenfunctions as principal components, we refer to e.g., [BJ02].
As applications of the theory presented in the last section, 2, we include a discussion of new non-linear, and real-life data/simulated examples. Moreover, inside the sections, we offer explanations for our use of RKHS-theory, as applied in new KPCA based Gaussian process simulations. In particular, we outline how they serve to produce optimal choices for detection of maximal variance.
The optimization issues in the present section refer to details in subsections 2.3 and 2.4; see especially 2.9, 2.10 and 2.11, all adapted to the PCA issues at hand. Our notation for solutions to optimization problems is “argmax” (standard terminology in the optimization literature.) The role of kernels enters via associated choices of* feature maps* , see (2.45), (2.46), (2.47), (2.48), (2.49), and subsection 2.5. Furthermore, our choices of frames, global frame-operators, and their adjoints (2.14), then allow a* duality approach* which in turn helps us gain insight into algorithms for the optimizers.
By a feature we mean an individual measurable property or characteristic. One chooses informative, discriminating and independent features to be used in algorithmic constructions in machine learning, in pattern recognition, and regression. Vector spaces associated with particular features are called the feature spaces. In KPCA, the choices of feature spaces come about from specified kernels (assumed positive definite, p.d.), see 2.2 below. If is a fixed p.d. kernel defined on general non-linear sets of data, a natural choice of feature space is then in the form of a Hilbert space is called the reproducing kernel Hilbert space (RKHS), written , see 2.3 below.
Definition 2.2**.**
Let be a set. A positive definite (p.d.) kernel on is a function , such that
[TABLE]
for all , , and .
Given a p.d. kernel as in (2.1), there exists a reproducing kernel Hilbert space (RKHS) and a mapping such that
[TABLE]
The function in (2.2) is called a feature map for the problem.
Moreover, the following reproducing property holds:
[TABLE]
for all , and .
While the theory of positive definite kernels and their RKHSs have served for a long time as powerful tools in diverse areas of pure and applied mathematics, the particular applications of them we need here are of a more recent vintage; for background literature, see e.g., [AJ15, BTA04, JT15, JT18a, JT18b]. The theory of frames (non-orthogonal expansions) also plays an essential role in our present approach to PCA, and the list of relevant papers for background includes [HKLW07, HWW05, CCEL15, CCK13].
Feature maps. In the subsequent discussion, reference to kernel will always mean positive definite kernel, see (2.1). For each choice of kernel, defined on data sets (typically non-linear configurations), there will then be associated feature maps. Generally, a feature map is simply a function which maps sets of data configurations (non-linear) into some choice of feature space, a Hilbert space, say . Typically the inner product in will serve to model correlation, or regression numbers, or other measurements corresponding to other features of interest. Given a kernel , a natural choice of feature space is the RKHS defined directly from , see (2.2), but there are other possibilities. In a different choice for , the range of , and the inner product on the right-hand side in (2.2) might be different. The main logic in the use of kernels in machine learning is that it yields representations of learning algorithm for data sets that will then become more amenable to regression analysis, detection, learning, and to classification. Analysis is aided with the use of powerful tools from the theory of linear operators in Hilbert space; see e.g., [HKLW07, JT18b, SZ07]. We stress that the phrase feature map is broad, and that a wide variety of functions may serve as feature maps. But the main use of them relates to suitable choices of kernels. Support Vector Machines (and other kernel-based methods) make use both implicit and explicit feature maps. This leads to remapping of data and will allow non-linearly separable data sets to get linear representations. The feature space leads to effective separation of data via suitable choices of hyperplanes in higher dimension. But reaching these dimensions might be computationally expensive because bad choices of feature mappings might require many computations.
Remark 2.3*.*
may be chosen as the Hilbert completion of
[TABLE]
with respect to the -inner product
[TABLE]
Initially the LHS in formula (2.5) only refers to finite linear combinations. Hence, the vector space (2.4) becomes a pre-Hilbert space. (By pre-Hilbert space we mean an inner product space that is not complete.) The RKHS itself then results from the standard Hilbert completion. It is this Hilbert space we will use in our subsequent study of optimization problems, and in our KPCA-dimension reduction. Sections 2.1–2.3 deal with separate issues of kernel-optimization. Before turning to these, however, we will first introduce a setting of Hilbert-Schmidt operators. This will play a crucial role in the formulation of our main result, 2.11 in 2.3.
Recall that a data set , , may be viewed as an matrix , where is the column vector. Here, is the number of features, and the number of sample points. Set
[TABLE]
where in (2.6) denotes the Hilbert-Schmidt norm.
Remark 2.4*.*
Let be a Hilbert space, and let be the Hilbert-Schmidt operators with inner product
[TABLE]
Then the two Hilbert spaces , and (tensor-product), are naturally isometrically isomorphic via
[TABLE]
Indeed,
[TABLE]
and the assertion follows from isometric extension of (2.8).
2.1. Application to Optimization
One of the more recent applications of kernels and the associated reproducing kernel Hilbert spaces (RKHS) is to optimization, also called kernel-optimization. See [YLTL18, LLL11]. In the context of machine learning, it refers to training-data and feature spaces. In the context of numerical analysis, a popular version of the method is used to produce splines from sample points; and to create best spline-fits. In statistics, there are analogous optimization problems going by the names “least-square fitting,” and “maximum-likelihood” estimation. In the latter instance, the object to be determined is a suitable probability distribution which makes “most likely” the occurrence of some data which arises from experiments, or from testing.
What these methods have in common is a minimization (or a max problem) involving a “quadratic” expression with two terms. The first in measures a suitable -square applied to a difference of a measurement and a “best fit.” The latter will then to be chosen from a number of suitable reproducing kernel Hilbert spaces (RKHS). The choice of kernel and RKHS will serve to select desirable features. So we will minimize a quantity which is the sum of two terms as follows: (i) a -square applied to a difference, and (ii) a penalty term which is a RKHS norm-squared. (See eq. (2.10).) In the application to determination of splines, the penalty term may be a suitable Sobolev normed-square; i.e., norm-squared applied to a chosen number of derivatives. Hence non-differentiable choices will be “penalized.”
In all of the cases, discussed above, there will be a good choice of (i) and (ii), and we show that there is then an explicit formula for the optimal solution; see eq (2.14) in 2.5 below.
Let be a set, and let be a positive definite (p.d.) kernel. Let be the corresponding reproducing kernel Hilbert space (RKHS). Let be a sigma-algebra of subsets of , and let be a positive measure on the corresponding measure space . We assume that is sigma-finite. We shall further assume that the associated operator given by
[TABLE]
is densely defined and closable.
Fix , and , and set
[TABLE]
defined for , or in the dense subspace where is the operator in (2.9). Let
[TABLE]
be the corresponding adjoint operator, i.e.,
[TABLE]
Our present RKHS/ framework is close to that of [SS16]. But we have included the results we need in our present framework, and for our purpose.
Theorem 2.5**.**
Let , , , be as specified above; then the optimization problem
[TABLE]
has a unique solution in , it is
[TABLE]
where the operator and are as specified in (2.9)-(2.12).
Proof.
We fix , and assign where varies in the dense domain from (2.9). For the derivative \frac{d}{d\varepsilon}\big{|}_{\varepsilon=0} we then have:
[TABLE]
for all in a dense subspace in . The desired conclusion follows. Note that convexity of the function makes these conditions sufficient as well. ∎
As already mentioned, the general conclusion in the present 2.5 connects to both many uses of kernel techniques in classical analysis, as well as to new and exciting applications. In the second group, we mention the following approach to Neural Networks [YO21], leading in turn to encoding of deep regularization into training of inner layers which make up specific Neural Network (NN) constructions, with the use of kernel flows. For details, we refer to [YO21] and the cited literature for details. But here we only stress that, as a step in particular NN-designs, one makes use of an recursive iteration of substitution into a fixed kernel (the kernel resulting from the substitution iterations is called a warped kernel); and then, at each step, there will be an RKHS-optimization of a term that arises as a special case of (2.13), see [YO21, eq (1.2)].
Least-square Optimization
To help readers put our present results on feature space and RKHS penalty-term into context, we now specialize the optimization formula from 2.5 to the problem of minimize a “quadratic” quantity . It is still the sum of two individual terms: (i) a -square applied to a difference, and (ii) a penalty term which is the RKHS norm-squared. But the least-square term in (i) will simply be a sum of a finite number of squares of differences; hence “least-squares.” As an application, we then get an easy formula (2.6) for the optimal solution.
Let be a positive definite kernel on where is an arbitrary set, and let be the corresponding reproducing kernel Hilbert space (RKHS). Let , and consider sample points:
as a finite subset in , and
as a finite subset in , or equivalently, a point in .
Fix , and consider , defined by
[TABLE]
We introduce the associated dual pair of operators as follows:
[TABLE]
where
[TABLE]
for all .
Note that the duality then takes the following form:
[TABLE]
consistent with (2.12).
Applying 2.5 to the counting measure
[TABLE]
for the set of sample points , we get the two formulas:
[TABLE]
where denotes the matrix
[TABLE]
Theorem 2.6**.**
Let , , , and be as above, and let be the induced sample matrix (2.22). Fix ; consider the optimization problem with
[TABLE]
Then the unique solution to (2.23) is given by
[TABLE]
i.e., on .
Proof.
From 2.5, we get that the unique solution is given by:
[TABLE]
and by (2.20)-(2.21), we further get
[TABLE]
where the dot refers to a free variable in . An evaluation of (2.25) on the sample points yields:
[TABLE]
where , and . Hence
[TABLE]
Now substitute (2.27) into (2.26), and the desired conclusion in the theorem follows. We used the matrix identity
[TABLE]
∎
2.2. The Case of Gaussian Fields
For a number of applications, it will be convenient to consider general stochastic processes indexed by , where is merely a set; so not a priori equipped with any additional structure. Consideration of stochastic processes will always assume some fixed probability space, where is a set of sample points; is a -algebra of events, fixed at the outset; and is a probability measure defined on . A given process is then said to be Gaussian and centered iff (Def.) for all choice of finite subsets of (), then the system of random variables is jointly Gaussian, i.e., the joint distribution of on is the Gaussian which has mean zero, and covariance matrix
[TABLE]
so for ,
[TABLE]
If is a Borel set, then
[TABLE]
holds. Note we consider the joint distributions for all finite subsets of .
Let be a set, and let be a Gaussian process with , ; and with
[TABLE]
as its covariance kernel. Finally, let be the corresponding RKHS.
Then the following general results hold (see e.g., [AJL11, AJ12, AJS14, AJ15, JT15, JT16a, JT16b, AJL17, JT18a, JT18b]):
- (i)
Every positive definite kernel arises as in (2.31) from some Gaussian process . 2. (ii)
Assume is separable; then we have a representation for a system of functions , ,
[TABLE]
absolutely convergent on . 3. (iii)
A system satisfies (ii) if and only if it forms a Parseval frame in . 4. (iv)
Given (2.32), then, for every sequence of independent identically distributed (i.i.d.) Gaussian system \left\{Z_{n}\right\}_{n\in\mathbb{N}},$$Z_{n}\sim N\left(0,1\right), i.e., each is a standard Gaussian random variable, , ; the representation
[TABLE]
is valid in of the underlying probability space, and on .
Note 2.7*.*
When a fixed Gaussian process is given, then the associated decomposition (2.33) is called a Karhunen-Loève (KL) transform for . The conclusion from (i)–(iv) above is that there is a direct connection between the two KL transforms the relatively better known KL-transforms for positive definite kernels ([JS07] Theorem 4.15).
The object in principal component analysis (PCA) is to find optimal representations; and to select from them the “leading terms”, the principal components.
Remark 2.8*.*
The present general kernel framework (RKHSs and Gaussian processes) encompasses the special case we outlined in [JS07] Example 3.1. By way of comparison, note that the particular positive definite kernel in the latter example is only a special case of the present ones, see (2.32) and (2.33). These types of kernels are often referred to as the case of Mercer kernels; see also [SY06, SZ07, SZ09a, SZ09b]. A Mercer kernel is continuous, and it defines a trace class operator, as illustrated in the example. This latter feature in turn leads to a well defined “top part of the spectrum.” And this then allows us to select the principal components; i.e., the maximally correlated variables. We shall show, in 2.11 below, that there is an alternative approach to principal components which applies to the general class of positive definite kernels, and so goes far beyond the case of Mercer kernels.
2.3. Optimization and Frames
Fix a p.d. kernel on , i.e., a functional satisfying (2.1); and let be the associated RKHS. In PCA, one solves the quadratic optimization problem:
[TABLE]
where is a selfadjoint projection.
Kernel PCA, by contrast, solves a similar problem in :
[TABLE]
where
[TABLE]
It is understood that as in (2.35) refers to the Hilbert-Schmidt class in .
A Finite Frame
Let , , and be as above. Then is a finite frame whose span is a closed subspace in .
Set by
[TABLE]
where is the standard ONB in . The adjoint is given by
[TABLE]
It follows that
[TABLE]
Lemma 2.9**.**
Let be as in (2.39), and let
[TABLE]
be the corresponding spectral representation, with . Then
[TABLE]
Equivalently, the best rank-1 approximation to is
[TABLE]
Remark 2.10*.*
Note that the conclusion of the lemma yields a solution the optimization problem we introduced above. Indeed, in the statement of the lemma (see (2.40)) we use the standard notation argmax for the data which realizes a particular optimization. In the present case, we are maximizing a certain quadratic expression over the unit-ball in the Hilbert-Schmidt operators. The Hilbert-Schmidt norm is designated with the subscript HS. Part of the conclusion of the lemma asserts that the maximum, as specified in (2.40), is attained for a definite rank-one operator.
Proof of 2.9.
Note that
[TABLE]
Let be a unit vector in , and set ; then
[TABLE]
Since , the r.h.s. of (2.41) is a convex combination of ’s; therefore,
[TABLE]
and equality holds if and only if , and , for .
Note that, given (2.41), the conclusion of the lemma follows directly from comparison of positive series. ∎
2.9 can be applied inductively which yields the best rank-1 approximation at each iteration. In fact, the result holds more generally; see 2.11 below.
Our present RKHS/ framework is close to that of [SS16]. But we have included the results we need in our present framework, and for our purpose.
Theorem 2.11**.**
(a) Let be a Hilbert Schmidt operator, and let
[TABLE]
with , and as . Then
[TABLE]
Note that , where . See [JS07]. (b) Recall the assumption that is Hilbert Schmidt is equivalent to trace class. We then apply the Spectral Theorem to yielding an orthonormal basis of eigenvectors for , and a convergent representation for the infinite sum in (2.42).
Proof.
Let , where each is rank-1, and , for . Then,
[TABLE]
Let be the corresponding Lagrangian, i.e.,
[TABLE]
It follows that
[TABLE]
Hence is a spectral projection of . The conclusion of the theorem follows from this.
Note that one may use the usual variational directional derivative argument applied to a sesquilinear form in any Hilbert space , and obtain
[TABLE]
In particular, if is self-adjoint, then
[TABLE]
Especially, the Hilbert space in (2.43) is the space of Hilbert-Schmidt operators. (This argument is a key step in the proof of the spectral theorem, i.e., taking the variational directional derivative of a bounded bilinear form in Hilbert space.)
The precise meaning of (and the justification for) the generalized gradient assertion (2.43) is included in 2.12 below. See also the following citations of sources for the underlying operator theory. ∎
The variational arguments in the proof of 2.11, used in the present setting of Hilbert space, and the variety of projections, can be justified with standard tools from operator theory, including the spectral theorem; see e.g., [JT17, ch.2], [JT21, ch.1], [Con90, ch.2] and also [AK06, BJ02, HKLW07, SZ09a]. We have included more details below:
Lemma 2.12**.**
Suppose is a Hilbert-Schmidt (HS) operator in a separable Hilbert space . For the directional derivative , we have
[TABLE]
(Here, is the rank-1 operator using Dirac’s notation.)
Proof.
Let . Note that . In fact, if is an orthonormal basis in , then
[TABLE]
But
[TABLE]
where denotes the Hilbert-Schmidt inner product.
Now we use a general fact about inner products:
[TABLE]
where the Hilbert-Schmidt operators, viewed as a Hilbert space, relative to its trace-inner product (2.44). That is, the Schwarz inequality is applied to the -inner product. One obtains maximum when . In particular, if , then and .
Therefore, it follows from that
[TABLE]
which is identified with , since
[TABLE]
∎
Remark 2.13*.*
Our use of frame analysis serves as the technical tools for “error estimates.” More specifically, frames are designed such that, for a given problem involving finite-dimensional PCA computations, the frame estimates are designed to allow estimation of the corresponding “error terms.” (See [JS07].)
For our present considerations of optimization questions, it may be of interest to consider a comparison between the following two settings, one general, and the other special: On the one hand, there is a variety of (i) general calculus of variation issues in infinite-dimensional contexts; and on the other, (ii) particular optimization questions in the restricted framework of specific choices of pairs of Hilbert spaces (still infinite-dimensional). We note that the context for (i) is wider than that of (ii). On the other hand (as outlined below), for the specialized framework of (ii) considered here, the use of standard Hilbert space geometry, and a little operator theory, offer much simplification.
Detail: Our present focus is (ii). By contrast, in consideration of (i) one must address separately such technical issues as (a) functional-derivatives, and (b) existence of extremal (optimal) solutions. By contrast, the smaller class of optimization questions from (ii) lend themselves directly to computation in Hilbert space. Here we note that application of Hilbert space geometry will then simplify the two issues (a) and (b) in (i). The point we make for our present optimization question (i.e., a specific choice of a Hilbert space context (ii)), is that a “natural” choice of a pair of Hilbert spaces will be a combination of the two: an appropriate RKHS, and a compatible -Hilbert space. This then allows for relatively simple solutions.
2.4. The Dual Problem
Fix a data set , . Let be the feature map in (2.36), i.e.,
[TABLE]
Let , be the analysis and synthesis operators from (2.37)-(2.38), and
[TABLE]
be the frame operator in (2.39). In particular,
[TABLE]
In view of [JS07] and 2.11, the KL basis for contains the principal directions carrying the greatest variance in . In applications, it is more convenient to first find the KL basis of instead, where
[TABLE]
as an matrix in ; see (2.2). (By general theory, if is a linear operator in a Hilbert space with dense domain, then .)
Theorem 2.14**.**
Set by
[TABLE]
and extend linearly, where denotes the standard basis in . Then the adjoint operator is
[TABLE]
That is, and .
Proof.
Let , and , then
[TABLE]
and the assertions follows. ∎
Hence is the Gramian matrix in given by
[TABLE]
see (2.47).
By the singular value decomposition, , so that
[TABLE]
where consists of the non-negative eigenvalues of . Therefore,
[TABLE]
Note that is the KL basis that diagonalizes as in (2.46), i.e.,
[TABLE]
It also follows from (2.51)–(2.52), that
[TABLE]
Remark 2.15*.*
In the above discussion, may be centered by removing its mean. Specifically, let
[TABLE]
be the projection onto , where denotes the constant vector . Then
[TABLE]
and so
[TABLE]
The effect of in (2.56) is to exclude the eigenspace of the Gramian spanned by the constant eigenvector.
In what follows, we shall always assume is centered as in (2.55)-(2.56).
2.5. Feature Selection
Feature selection, also called variable selection, or attribute selection, is a procedure for automatic selection of those attributes in data sets which are most relevant to particular predictive modeling problems. Which features should one use in designs of predictive models? This is a difficult question that requires detailed knowledge of the problem at hand. The aim is algorithmic designs which automatically select those features from prescribed data, which are most useful, or most relevant, for the particular problem. The process is called feature selection. A central premise of feature selection is that the input data will contain features that are either redundant or irrelevant, and can therefore be removed. The use of sample correlations in the process is based in turn on the following principle: A particular relevant feature might be redundant, in the presence of some other relevant feature, with which it is strongly correlated.
Our present purpose is not a systematic treatment of feature selection, but merely to identify how our present tools suggest recursive algorithms in the general area. With this in mind we now consider the following setup:
Let be a test example. The image under the feature map can be projected onto the principal directions in , via
[TABLE]
The mapping is in general nonlinear. See Examples 2.17 and 2.18, and Figures 2.1 and 2.2 below. We now turn to our main result which is the following corollary.
Corollary 2.16**.**
Let be as above, assuming is full rank. For all , the coefficients of the projection are
[TABLE]
Proof.
By (2.54), , so that
[TABLE]
∎
Example 2.17** (Spectral Clustering, see 2.1).**
This is included as an instructive example. The data set has two classes: Class “0” consists of points uniformly distributed in ; class “1” consists of 126 points in the open ball . Hence has dimension , and each column of corresponds to a sample point.
We choose the Gaussian kernel with , so that is embedded into the associated RKHS by
[TABLE]
By projecting onto the first two principal components, each sample point has a 2D representation via the mapping
[TABLE]
See the general formula in 2.16; note that in the present example.
As shown in 2.1b, the two clusters are linearly separable in .
Example 2.18** (Dimension Reduction, see 2.2).**
Let be a collection of 100 grayscale images of an ellipse, rotated successively by . 2.2a shows 6 sample images corresponding to different rotation angles. The images are unrolled as column vectors, thus has dimension .
This data set may be viewed as 1D submanifold embedded in , i.e., it has only one degree of freedom, the rotation angle. For dimension reduction, KPCA will ideally extract this information, and each image is then represented by a single projection coefficient. We choose the Gaussian kernel with . In 2.2b, there are 4 subplots consisting of the projections onto the first, second, third, and fourth principal directions. The rotation angle is encoded in e.g. PC 1. As a consequence, the dimension of the data set is reduced from to .
In particular, the projection coefficients onto the principal direction (see 2.2b, PC1 – PC4) are proportional to the eigenvector of the centered Gramian ; see (2.56) and 2.16.
2.6. Dynamic PCA
In some applications, the tool of Dynamic PCA (DPCA), see e.g., [HAI*+*07] serves as a useful tool for high-dimensional and time-dependent data. The idea is that in favorable cases, with the use of DPCA, one can arrange that the input matrices can be augmented by addition of time-lagged values of the variables under consideration. In summary, the method is based on the behavior of the eigenvalues of the lagged autocorrelation, and partial autocorrelation, matrices.
Our present 2.11 states that when a system of PCA eigenvalues is given, then we have a algorithmic solution to the corresponding optimization question (see (2.34)-(2.35)). We now turn to a formula for generating PCA features as a limit of a certain iteration of operators. The family of operators discussed below is a generalization of the operators from 2.3 above.
When principal component analysis (PCA) is used for problems involving stochastic (statistical) processes, for example in monitoring applications, it often relies on an implicit, but unrealistic, assumption that data involved are time independent. This however is evidently unrealistic, as work with industrial data shows: In a host of diverse applications, one is faced with serial correlations. Hence, the literature over recent years (see e.g., [AHR*+*20, KOL20, Sha20, GS20, Sha19]) has witnessed a host of new approaches to PCA, going by what is now called dynamic PCA (DPCA) methods. DPCA-models involve a variety of stochastic and dynamical systems-tools, each one serving as a remedy for plain-vanilla PCA applied to data as they appear in: (i) in extremely high-dimensions, (ii) in time-dependent data-sets, and (iii) in digital image processing. A main feature of DPCA is time-feedback; — hence in DPCA, the input-matrix is dynamically augmented, thus taking the form of a transfer operator;— in the simplest models, it operates by dynamic additions of time-lagged values of the variables.
Proposition 2.19**.**
Suppose is compact, and has simple spectrum, i.e.,
[TABLE]
with ; as . Setting , then,
[TABLE]
Moreover, given , then
[TABLE]
where convergence in (2.59) is w.r.t. the norm topology of .
Proof.
Note that
[TABLE]
Now, given , we have
[TABLE]
∎
Corollary 2.20**.**
The system of PCA eigenvalues can be obtained inductively as follows: Let as before, and set . Then
[TABLE]
and
[TABLE]
Proof.
One checks that
[TABLE]
and so the assertion follows from 2.19. ∎
Example 2.21**.**
Consider the covariance function of standard Brownian motion , , i.e., a Gaussian process with mean zero and covariance function
[TABLE]
Let be a finite subsets of , such that
[TABLE]
and let
[TABLE]
Lemma 2.22**.**
Let be as in (2.61). Then
- (i)
The determinant of is given by
[TABLE] 2. (ii)
* assumes the LU decomposition*
[TABLE]
where
[TABLE]
Proof.
For details, see e.g., [JT15]. ∎
Example 2.23**.**
Fix , then the top eigenvalue of can be extracted by the method from 2.19. For instance, if and , we have
[TABLE]
Let , then
[TABLE]
Standard numerical algorithm returns
[TABLE]
Acknowledgement*.*
The co-authors thank the following colleagues for helpful and enlightening discussions: Professors Daniel Alpay, Sergii Bezuglyi, Ilwoo Cho, Wayne Polyzou, David Stewart, Eric S. Weber, and members in the Math Physics seminar at The University of Iowa. The present work was started with discussions at the NSF-CBMS Conference, Harmonic Analysis: Smooth and Non-Smooth, held at Iowa State University, June 4-8, 2018. Jorgensen was the main speaker. We are grateful to the organizers, especially to Prof Eric Weber, and to the NSF for financial support. Sooran Kang was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (#NRF-2017R1D1A1B03034697 and #NRF-2020R1F1A1A01076072).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ACP 13] Ery Arias-Castro and Bruno Pelletier, On the convergence of maximum variance unfolding , J. Mach. Learn. Res. 14 (2013), 1747–1770. MR 3104494
- 2[AFS 18] Carlos M. Alaíz, Michaël Fanuel, and Johan A. K. Suykens, Convex formulation for kernel PCA and its use in semisupervised learning , IEEE Trans. Neural Netw. Learn. Syst. 29 (2018), no. 8, 3863–3869. MR 3854652
- 3[Agg 18] C.C. Aggarwal, Machine learning for text , Springer International Publishing, 2018.
- 4[AH 11] Trine Julie Abrahamsen and Lars Kai Hansen, A cure for variance inflation in high dimensional kernel principal component analysis , J. Mach. Learn. Res. 12 (2011), 2027–2044. MR 2819026
- 5[AHR + 20] Mohammad Reza Askari, Iman Hajizadeh, Mudassir Rashid, Nicole Hobbs, Victor M. Zavala, and Ali Cinar, Adaptive-learning model predictive control for complex physiological systems: automated insulin delivery in diabetes , Annu. Rev. Control 50 (2020), 1–12. MR 4188895
- 6[AJ 12] Daniel Alpay and Palle E. T. Jorgensen, Stochastic processes induced by singular operators , Numer. Funct. Anal. Optim. 33 (2012), no. 7-9, 708–735. MR 2966130
- 7[AJ 15] Daniel Alpay and Palle Jorgensen, Spectral theory for Gaussian processes: reproducing kernels, boundaries, and L 2 superscript 𝐿 2 L^{2} -wavelet generators with fractional scales , Numer. Funct. Anal. Optim. 36 (2015), no. 10, 1239–1285. MR 3402823
- 8[AJL 11] Daniel Alpay, Palle Jorgensen, and David Levanony, A class of Gaussian processes with fractional spectral measures , J. Funct. Anal. 261 (2011), no. 2, 507–541. MR 2793121
