This paper introduces a novel approach to unsupervised dimension reduction by reformulating it with tempered distributions, establishing a connection with sufficient dimension reduction, and proposing algorithms for optimization and practical implementation.
Contribution
It presents a new formulation of dimension reduction using tempered distributions, linking it to SDR, and develops algorithms including an efficient approximate method for practical use.
Findings
01
Algorithms successfully reduce data dimensionality in synthetic and real datasets.
02
The approach effectively approximates empirical distributions with low-dimensional support.
03
The method demonstrates comparable or improved performance over existing techniques.
Abstract
We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function by another tempered distribution, supported in a k-dimensional subspace. We show that this task is connected with another classical problem of data science -- the sufficient dimension reduction problem (SDR). In fact, an algorithm for the first problem induces an algorithm for the second and vice versa. In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function that ``forces'' the support of the model distribution to be k-dimensional. Then we present an algorithm for the minimization of the penalized objective, based on the infinite-dimensional low-rank optimization, which we call the alternating scheme.âŚ
Tables6
Table 1. Table 1: The cross-validated accuracies/R 2 of KNN on 2 or 3-dimensional input representations.
Table 2. Table 2: Original images (the first row) and their projections to 1-dimensional subspaces computed by PCA (the second row) and computed by Gauss (the third row).
Table 3. Table 3: Original image, projected image and the difference between them (Gauss).
Table 4. Table 4: Original image, its projection, and their grayscale difference (Gauss).
Table 5. Table 5: Computed backgrounds are almost identical, though noise is more often in PCAâs output.
Input images
A background computed by PCA
A background computed by Gauss
Table 6. Table 6: Measures of consistency with the ground truth background image for the SBMnet dataset.
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function by another tempered distribution, supported in a k-dimensional subspace. We show that this task is connected with another classical problem of data science â the sufficient dimension reduction problem (SDR). In fact, an algorithm for the first problem induces an algorithm for the second and vice versa.
In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function that âforcesâ the support of the model distribution to be k-dimensional.
Then we present an algorithm for the minimization of the penalized objective, based on the infinite-dimensional low-rank optimization, which we call the alternating scheme.
Also, we design an efficient approximate algorithm for a special case of the problem, where the distance between the empirical distribution and the model distribution is measured by Maximum Mean Discrepancy defined by a Mercer kernel of a certain type.
We test our methods on four examples (three UDR and one SDR) using synthetic data and standard datasets.
Linear dimension reduction (LDR) is a family of problems in data science that includes principal component analysis, factor analysis, linear multidimensional scaling, Fisherâs linear discriminant analysis, canonical
correlations analysis, sufficient dimension reduction (SDR), maximum autocorrelation factors, slow feature analysis and more. In unsupervised dimension reduction (UDR) we are given a finite number of points in Rn (sampled according to some unknown distribution) and the goal is to find a âlow-dimensionalâ manifold (e.g. an affine or a linear subspace) that approximates âthe supportâ of the distribution. UDR, historically, was approached by linear methods and, therefore, has developed into a set of standard tools in data science.
Though non-linear dimensionality reduction (a.k.a. the manifold learning) techniques gained a wide popularity in modern research, the potential of linear methods is far from being exhausted. For high-dimensional datasets, due to the phenomenon of concentration of measure [1], LDR often can give us an interpretable and low-dimensional representation of data. The linearity of a projection operator is a restrictive property that allows avoiding the overfitting in the dimension reduction (which is a key problem for the manifold learning techniques).
The LDR study field currently achieved a saturation level at which unifying frameworks for the problem become of special interest. One of such frameworks, that covers many cases of LDR, frames LDR as the optimization task over matrix manifolds such as the Stiefel manifold [2].
Elements of the Stiefel manifold Vkâ(Rn) are orthogonal k-frames OâRnĂk whose column space is the k-dimensional space onto which a dataset is projected. Different loss functions on Vkâ(Rn) define different versions of LDR. Table 1 of [2] lists fourteen common LDR techniques (such as principal component analysis, multi-dimensional scaling, linear discriminant analysis etc), nine of which are formulated over Stiefel manifolds. Such a general treatment allows to approach all LDR problems by a single algorithm, i.e. by an adaptation of the gradient descent method to Stiefel manifolds [3]. This adaptation consists of a series of projected gradient steps where a common gradient descent is followed by a projection onto a Stiefel manifold, which is equivalent to the computation of a singular value decomposition of a current point.
Note that the Stiefel manifold is a non-convex set and even minimizing a convex function on such a manifold is an NP-hard task, in general. Although, in applications, the projected gradient descent method demonstrated a relatively fast convergence to good quality solutions.
The paperâs main contribution is a development of a novel view of LDR.
First, an argument over which we search in an optimization task is not a k-frame, but a tempered distribution (which is a generalization of a probabilistic distribution) that is concentrated on a k-dimensional linear subspace of Rn. Thus, an argument has a more complex structure, it includes not just a k-dimensional subspace, but also a distribution on that subspace.
The justification of our optimization framework uses the theory of generalized functions, or tempered distributions [4, 5].
An important generalized function that cannot be represented as an ordinary function is the Dirac delta function, denoted δ, and δn denotes its n-dimensional version.
This more general formulation allows us to analyze new types of objectives for LDR. In Section 4 we list four examples of such objectives that, to our knowledge, have not been considered in the LDR field so far. A notable specifics of such objectives is that, even for a fixed k-dimensional subspace L, finding an optimal distribution supported in L is a non-trivial optimization task. In other words, our problems can not be simply reduced to the previous formalisms based on the Stiefel manifold, or the Grassmannian [6, 7].
Let us briefly describe an optimization problem that motivates the new formalism.
Any dataset {xiâ}i=1NââRn naturally corresponds to the distribution
[TABLE]
which, with some abuse of terminology, can be called the empirical probability density function.
Based on that, UDR can be understood as a task whose goal is to approximate
pempâ(x) by q(x), where q(x) is a distribution whose density is supported in a k-dimensional linear subspace LâRn. Note that a function whose density is supported in some low-dimensional subset of Rn is not an ordinary function. An exact definition of a set of such distributions, denoted by Gkâ, is given in Section 3.
To formulate an optimization task we additionally need a loss D(pempâ,q) that measures the distance between the ground truth pempâ and a distribution q, that we search for. Thus, in our approach, the UDR problem is defined as
[TABLE]
under the condition that q(x) has a k-dimensional support. In most of our statements we do not consider any specific loss functions, though in our basic examples we deal with the Maximum Mean Discrepancy distance or the Wasserstein distance.
The UDR and SDR. Within our formalism the sufficient dimension reduction problem is tightly connected with the UDR problem. In the SDR, given supervised data, the goal is to find the so called effective subspace, defined by its orthogonal basis (or, a k-frame) {w1â,âŻ,wkâ}âRn , such that the regression function can be searched in the form g(w1Tâx,âŻ,wkTâx). In literature, these functions are known under different
names, e.g. functions with low effective dimensionality [8], functions with active subspaces [9]
and multi-ridge functions [10, 11].
In [12] it was shown that a method originally developed for the SDR can be turned into a UDR method, i.e. applied to unsupervised data, by simply setting an output to be equal to an input.
In such methods for the SDR problem as the Sliced Inverse Regression [13], the Principal Hessian Direction [14], the Sliced Average Variance Estimation [15], an effective subspace is recovered from the Singular Value Decomposition applied to a certain matrix that is constructed from a training set in a straightforward way. Other methods, such as the Principal Fitted Components [16], the Likelihood Acquired
Direction [17], the Kernel Dimensionality Reduction [18], are based on analytic expressions measuring the affinity of a k-dimensional subspace to the effective subspace. The second type of methods reduce the SDR problem to an optimization problem over the Stiefel manifold, or the Grassmanian. For other methods we refer to a tutorial on SDR methods [19]. Again, an important aspect of all these methods is that, given a fixed effective subspace, the regression function that predicts an output variable has a relatively straightforward structure and is not optimized by any additional supervised learning procedure. The key novelty that our framework brings to the SDR is that we suggest to search for an effective subspace and a regression function in a joint manner.
The key observation of our analysis, stated in Theorem 2, is that a class of functions of the form g(w1Tâx,âŻ,wkTâx) can be characterized as functions whose Fourier transform is supported in the corresponding effective subspace. In other words, functions with an effective dimensionality k are dual to Gkâ under the Fourier transform. Three examples of UDR problems that we give in Section 4 are cast as (2), whereas in the fourth example we formulate SDR as an optimization task with the search space dual to that of UDR (to distinguish our formulation from a general SDR problem we call it an SDR with optimized regression function). Thus, all four examples can be studied within our optimization framework.
Besides the problem setup we also suggest a general algorithm that tackles it.
The basic idea of that algorithm, which we call the alternating scheme, instead of optimizing over Gkâ, to optimize over ordinary functions with a penalty added to an objective that forces the ordinary functionâs support to be low-dimensional.
The penalty based reformulation.
The starting point of our approach is to reduce the task (2) to the minimization of I(q)+ÎťR(q) over ordinary functions q. We define the penalty function R(q) in such a way that forcing R(q) to be small is equivalent to forcing âthe supportâ of q to be k-dimensional.
Our definition of R is based on using a positive definite kernel M:RnĂRnâC.
First we note that M defines a billinear form on pairs of (possibly, generalized) functions by â¨fâŁMâŁgâŠ=âŤRnĂRnâf(x)âM(x,y)g(y)dxdy. On a properly defined space of (generalized) functions, the billinear form â¨â âŁMâŁâ ⊠is the hermitian inner product, using which one can define distances and other geometrical notions on that space. Note that if f and g are probability density functions and M is real-valued, the corresponding distance function, i.e. distMâ(f,g)=(â¨fâŁMâŁfâŠââ¨gâŁMâŁgâŠâ2â¨fâŁMâŁgâŠ)1/2 coincides with the maximum mean discrepancy metric [20].
We define R(q) as
[TABLE]
where
Mqâ=Re[â¨xiâq(x)âŁMâŁxjâq(x)âŠâ]i,j=1,nââ
and Îť1â(Mqâ)âĽÎť2â(Mqâ)âĽâŻ are ordered eigenvalues of the matrix Mqâ. The sum of all but first k eigenvalues of a positive semidefinite matrix A is a well-known penalty function, denoted by âĽAâĽnâkâ and called a Ky Fan nâk-antinorm. Applications of the Ky Fan nâk-antinorm to low-rank optimization problems can be found in [21, 22, 23, 24] and its properties are studied in [25].
over ordinary functions. An analysis that we make in Subsection 5.3 of Section 5 (based on theory of tempered distributions) shows that if the kernel M is chosen from a class of so called proper kernels and the solution of (2) satisfies certain regularity conditions, the solution of the task (4) for Îťâ+â will approach the solution of (2).
The alternating scheme. The task (4) can be understood as an infinite dimensional low-rank optimization task in which the penalty term forces the matrix Mqâ to be of rank k. In Section 6 we prove that Mqâ=SqâSqâ â where Sqâ is a linear operator between a suitable space H and Rn that itself depends on q linearly, and this automatically gives us that R(q)=minSââĽSqââSâĽâ2â where the minimum is taken over all operators between H and Rn of rank k and âĽâ âĽââ is a suitable norm on the space of bounded linear operators from H to Rn.
Then, a natural idea to solve the task (4) is to present it as the joint minimum qminâS:rankS=kminâI(q)+ÎťâĽSqââSâĽâ2â and to minimize the objective over q and over S of rank k in an alternating fashion, i.e.
[TABLE]
This algorithm, called the alternating scheme, is suitable for a practical implementation due to the fact that the second step of it, i.e. the optimization over S of rank k, is solvable analytically. In fact, Sl+1â is the Singular Value Decomposition of Sql+1ââ truncated at k-th term.
In E we give an algorithm whose every step is equivalent to a corresponding step of the alternating algorithm, but it operates on Fourier transforms of functions rather than on functions of initial coordinates. Numerical specifications of the alternating scheme for different special cases of UDR/SDR problems are given in G,  H,  I and J. In Section 8 we describe results of our experiments with the alternating scheme that we conducted for various synthetic and practical datasets. As a result we conclude that the alternating scheme is a practical algorithm that can be applied to datasets of moderate size. For the SDR tasks its performance is comparable with classical algorithms.
An approximate algorithm for a special case. For a special case of the task (2), where the distance function is the Maximum Mean Discrepancy with the kernel of the form K(x,y)=(xâ y)H(x,y) and H is itself a Mercer kernel,
we develop an approximate algorithm that can be applied to large datasets. In Section 7 we demonstrate that a solution with provable approximation ratios is given by the following simple procedure: given a dataset {xiâ}i=1Nâ we build a data matrix X=[x1â,âŻ,xNâ], a Gram matrix G=[H(xiâ,xjâ)], and output first k principal components of the matrix XGXT. This algorithm is tested on Yale B dataset for the shadow/black removal and SBMnet datasets for the background modeling. In both applications, our approximate algorithm showed a performance comparable to the performance of other low-rank approximation algorithms.
The structure of the paper is as follows. In Section 2 we give some notations and define standard notions from functional analysis that we use throughout the paper.
In Section 3 we formally define the search space in Problem 2, denoted Gkâ, and an image of Gkâ under the Fourier transform, denoted Fkâ. In Section 4 we formulate some UDR/SDR problems as optimization tasks over Gkâ/Fkâ.
Instead of searching directly in a set of generalized functions, Gkâ,
in Section 5 we describe how we substitute an ordinary function for a distribution in the optimization task at the expence of adding a new penalty term to its objective, ÎťR(f). Using a kernel M(x,y), Theorem 4 characterizes generalized gâGkâ as such g for which the matrix of properly defined integrals Mgâ=Re[âŹRnĂRnâxiâyjâg(x)âM(x,y)g(y)dxdyâ]i,j=1,nââ is of rank k.
In Section 6 we suggest a method for solving minĎâI(Ď)+ÎťR(Ď) which we call the alternating scheme. In Section 7 we describe a simple approximate algorithm for the task (2) in a special subcase of the Maximum Mean Discrepancy distance and prove some theoretical guarantees on the approximation ratio of this algorithm.
Section 8 is dedicated to experiments with the alternating scheme on synthetic and real world data and with the approximate algorithm on the shadow/black removal and the background modeling applications. Proofs of all theorems and lemmas are given after their formulations, or can be found in the appendix.
1.1 Related work
As was already mentioned, another unifying framework for LDR tasks is suggested by [2] in which the basic search space is the Stiefel manifold Vkâ(Rn).
The main advantage of the Stiefel manifold over Gkâ is that its elements are finite-dimensional. Because a distribution from Gkâ is an infinite-dimensional object, an optimization over Gkâ requires additional constructions to turn it into a finite-dimensional task. Both an optimization over Gkâ and over Vkâ(Rn) is typically hard: for a final point, at best one can guarantee that it is a local extremum. Promising aspects of Gkâ are: a) Gkâ allows to formulate a new class of objectives naturally on it, b) local extrema on Gkâ substantially differ from local extrema on Vkâ(Rn), because a local search over Gkâ uses more degrees of freedom.
There is plenty of literature on the SDR problem some of which was already mentioned. In [26] the Fourier transform was applied for estimating the effective subspace in SDR, implicitly using an analog of Theorem 2.
The closest to ours is a recent approach of [27], where an effective subspace was computed in a two step process. First, given supervised data, a regression function was trained in the form of a neural network (with a general architecture), then the obtained regression function was approximated by another neural network with a bottleneck architecture (by which a low effective dimensionality is guaranteed by construction). Like in this approach, we train a regression function as a neural network, though we search for it and an effective subspace jointly. In our approach, it is a regularization term R(f) that forces the neural network to have a low effective dimensionality.
Using Ky Fan k-antinorm as a regularizer for the matrix completion problem has been suggested by [21] and further developed in [22, 23, 24]. Unlike this chain of works, we formulate an infinite-dimensional task and our regularizer R(f)=âĽMfââĽnâkâ is a sum of smallest nâk squared singular values of the infinite-dimensional operator Sfâ where Sfâ depends on f linearly and Mfâ=SfâSfâ â. Thus, our algorithms are substantially different from algorithms designed within the latter approach. The idea of alternating two basic stages, convex optimization and SVD, is ubiquitous in low-rank optimization, see e.g. [28, 29].
2 Preliminaries and notations
Throughout this paper we use standard terminology and notation from functional analysis.
For details one can address the textbook on the theory of distributions [30]. The Schwartz space, denoted by S(Rn), is a space of infinitely differentiable functions f:RnâC such that âÎą,βâNn,supxâRnââŁxÎąDβf(x)âŁ<â, and equipped with standard topology.
Its dual space is denoted by Sâ˛(Rn) and is equipped with weak topology.
For a tempered distribution TâSâ˛(Rn) and ĎâS(Rn), â¨T,Ď⊠denotes T(Ď). Thus, for a sequence {fsâ}âSâ˛(Rn) and fâSâ˛(Rn), limsâââfsâ=f (or, fsâââf) means that limsââââ¨fsâ,ĎâŠ=â¨f,Ď⊠for any ĎâS(Rn). For a sequence {fsâ}s=1âââSâ˛(Rn), sââLimâfsâ denotes a set of points fâSâ˛(Rn), such that there exists a growing sequence {siâ}âN and limiâââfsiââ=f.
The Fourier and inverse Fourier transforms are denoted by F,Fâ1:Sâ˛(Rn)âSâ˛(Rn). For brevity, we denote F[f] by f^â. If all required conditions are satisfied, an integrable f:RnâC (or, a Borel measure Îź on Rn) is used as the tempered distribution Tfâ (or, TÎźâ) where â¨Tfâ,ĎâŠ=âŤRnâf(x)Ď(x)dx (or, â¨TÎźâ,ĎâŠ=âŤRnâĎ(x)dÎź). For ΊâS(Rn), Ί denotes the sequential closure of Ί with respect to standard topology of S(Rn). For ΊâSâ˛(Rn), Ίâ denotes the sequential closure of Ί with respect to weak topology of Sâ˛(Rn).
For ĎâS(Rn),TâSâ˛(Rn), the convolution is defined as a tempered distribution ĎâT such that â¨ĎâT,ĎâŠ=â¨T,Ď~ââĎ⊠where Ď~â(x)=Ď(âx). If TâSâ˛(Rn) and a function Ď is such that Ď(x)Ď(x)âS(Rn) whenever Ď(x)âS(Rn), then the multiplication ĎT is defined by â¨ĎT,ĎâŠ=â¨T,ĎĎâŠ.
Given a measure Îź, by L2,Îźâ(Rn) we denote the complex L2â-space with the inner product â¨u,vâŠL2,Îźââ=âŤu(x)âv(x)dÎź. The induced norm is then âĽuâĽL2,Îźââ=â¨u,uâŠL2,Îźâââ. If dÎź=p(x)dx, then L2,Îźâ is denoted by L2,pâ.
A set of infinitely differentiable functions in Rn is denoted by Câ(Rn). A set of infinitely differentiable functions with compact support in Rn is denoted by Ccââ(Rn).
If T is a topological space, then a subset SâT is said to be dense in T if the sequential closure of S is equal to T.
For a square matrix A, Tr(A) denotes its trace and for an arbitrary matrix, âĽAâĽFâ=defTr(ATA)â.
The identity matrix of size n is denoted by Inâ. The notation fâg means f=cg where c is some universal constant.
3 Basic function classes
To formalize distributions supported in a k-dimensional subspace, we need a number of standard definitions [31]. For Ď1ââS(Rk) and Ď2ââS(Rnâk), their tensor product is the function Ď1ââĎ2ââS(Rn) such that (Ď1ââĎ2â)(x,y)=Ď1â(x)Ď2â(y). The span of {Ď1ââĎ2ââŁĎ1ââS(Rk),Ď2ââS(Rnâk)}, denoted by S(Rk)âS(Rnâk), is called the tensor product of S(Rk) and S(Rnâk). For g1ââSâ˛(Rk) and g2ââSâ˛(Rnâk), their tensor product is defined by the following rule: â¨g1ââg2â,Ď1ââĎ2ââŠ=â¨g1â,Ď1ââŠâ¨g2â,Ď2â⊠for any Ď1ââS(Rk),Ď2ââS(Rnâk). Since S(Rk)âS(Rnâk)â=S(Rn), there is only one distribution g1ââg2ââSâ˛(Rn) that satisfies the identity.
An example of a generalized function, whose density is concentrated in a k-dimensional subspace, is any distribution that can be represented as
g\otimes\delta^{n-k}\overset{def}{=}g\otimes\underbrace{\delta\otimes\cdots\otimes\delta}_{\text{n-k times}}
where gâSâ˛(Rk). If g=Tfâ, where f:RkâR is an ordinary function, then gâδnâk can be understood as a generalized function whose density is concentrated in a subspace {xâRnâŁxiâ=0,i>k} and equals f(x1:kâ). It can be shown that the distribution acts on ĎâS(Rn) in the following way:
[TABLE]
Now to generalize the latter definition to any k-dimensional subspace we have to introduce a change of variables in tempered distributions.
Let gâSâ˛(Rn) and UâRnĂn be an orthogonal matrix, i.e. UTU=Inâ. Then, gUââSâ˛(Rn) is defined by the rule: â¨gUâ,ĎâŠ=â¨g,Ď⊠where Ď(x)=Ď(UTx). If g=Tfâ, the latter definition gives gUâ=Tfâ˛â where fâ˛(x)=f(Ux). Now, we define classes of tempered distributions:
[TABLE]
[TABLE]
and
[TABLE]
where O(n)={UâRnĂnâŁUTU=Inâ}.
The first two classes are related as:
Theorem 1**.**
Gkâ˛â=Gkâââ.
The last two classes are isomorphic under the Fourier transform.
Theorem 2**.**
F[Gkâ]=Fkâ* and Fâ1[Fkâ]=Gkâ.*
Proof.
Let us prove first that if g=Tfââδnâk, then
F[g]=Trâ,
where r(x)=f^â(x1:kâ),xâRn. For that we have to prove that â¨F[g],ĎâŠ=â¨Trâ,Ď⊠for any ĎâS(Rn). Indeed,
[TABLE]
Let us calculate the image of Gkâ under the Fourier transform. It is easy to see that for any gâSâ˛(Rn),ĎâS(Rn) and orthogonal UâRnĂn we have:
[TABLE]
Therefore, F[gUâ]=(F[g])Uâ.
Thus, if g=Tfââδnâk, then
[TABLE]
where râ˛(x)=r(Ux)=f^â(Ukâx) and UkââRkĂn is a matrix consisting of first k rows of U. Thus, Trâ˛ââFkâ.
Let us show that by varying fâS(Rk) and U in the expression f^â(Ukâx) we can obtain any function from Fkâ. For this, it is enough to show that Fkâ is equivalent to the following set of functions:
[TABLE]
The fact QâFkâ is obvious.
Let us now prove that Qâ{g(Px)âŁgâS(Rk),PâRkĂn,rankP=k}=Fkâ. Indeed, if f(x)=g(Px), then f(x)=gâ˛(Ukâx) where Ukâ=(PPT)â1/2P and gâ˛(y)=g((PPT)1/2y). By construction, UkâUkTâ=Ikâ and gâ˛âS(Rk). Thus, Q=Fkâ.
Therefore, F[Gkâ]=Fkâ, and from the bijectivity of the Fourier transform we obtain Fâ1[Fkâ]=Gkâ.
â
For any collection f1â,âŻ,flââSâ˛(Rn), spanRâ{fiâ}1lâ denotes {âi=1lâÎťiâfiââŁÎťiââR}âSâ˛(Rn), which is a linear space over R.
The set Gkâ˛â has the following simple characterization:
Theorem 3**.**
For any TâSâ˛(Rn), TâGkâ˛â if and only if
[TABLE]
Informally, the theorem holds because any linear dependency Îą1âx1âT+âŻ+ÎąnâxnâT=0 over R implies that if Îą1âx1â+âŻ+Îąnâxnâî =0, then T=0. This is equivalent to a statement that the support of T is concentrated on a subspace Îą1âx1â+âŻ+Îąnâxnâ=0. If dimspanRâ{x1âT,x2âT,âŻ,xnâT}â¤k, then one can find nâk such dependencies, which means that the support of T is k-dimensional.
Let B(Rn) denote the Borel sigma-algebra on Rn and P denote a set of all Borel probability measures on Rn. Let us now define
[TABLE]
i.e. Pkâ is a set of probability measures with all probability concentrated in some subspace span(v1â,âŻ,vkâ) whose dimension is not greater than k. It is easy to see that TÎźââGkâ˛â for any ÎźâPkâ.
4 Examples of LDR formulations
Maximum mean discrepancy PCA (MMD-PCA)
Let K:RnĂRnâR be a continuous Mercer kernel, and HKâ be a reproducing kernel Hilbert space (RKHS) defined by K.
The kernel K(x,y) defines the so-called kernel embedding of probability measures ĎÂ [32]:
[TABLE]
The Maximum Mean Discrepancy (MMD) distance [20] is defined as the distance induced by metrics on
HKâ, i.e. for two probability measures Îź,νâP,
[TABLE]
Let x1â,âŻ,xNââRn be the dataset of points.
This dataset defines the empirical probabilistic measure Îźdataâ that corresponds to the tempered distribution
TÎźdataââ=N1ââi=1Nâδn(xâxiâ).
We shall study a method concurrent to PCA that is based on solving the following problem:
[TABLE]
i.e. we shall attempt to approximate the empirical probabilistic measure Îźdataâ with another probabilistic measure ν which is supported in some k-dimensional subspace of Rn.
To our knowledge, the task (17) has not been yet considered in the research field of LDR.
Example 1** (Gaussian MMD-PCA).**
Let k(x)=Ghnâ(x)
where
Ghnâ(x)=(2Ďh2)n/2eâ2h2âĽxâĽ2ââ
is the radial Gaussian kernel on Rn and K(x,y)=(kâk)(xây)=G2hnâ(x). For such a kernel, we have
[TABLE]
where Ď(Îź)=âŤk(xây)dÎź(y) is just a smoothing of the distribution Îź via the Weierstrass trasform.
In this example, as hâ+0, the optimal measure νâ=argminνâPkâââĽĎ(Îźdataâ)âĎ(ν)âĽL2â(Rn)â is supported in a k-dimensional subspace that contains the largest possible number of points from {x1â,âŻ,xNâ}.
Khachiyan demonstrated [33] that the following problem is NP-hard: given {x1â,âŻ,xNâ}âRn, find an nâ1-dimensional subspace of Rn that contains at least (1âÎľ)(1ân1â)N points from the dataset. This indicates that in the regime hâ+0, the task (17) is NP-hard. In other words, it is unlikely that the task admits an efficient algorithm, in general. In G we describe an algorithm for the Gaussian MMD-PCA.
The higher moments PCA (HM-MMD-PCA)
Another natural approach to measuring the similarity of two distributions is based on the difference between moments:
[TABLE]
where mi1ââŻisââ=EXâźÎźâ[X[i1â]âŻX[isâ]] and ni1ââŻisââ=EXâźÎ˝â[X[i1â]âŻX[isâ]] are corresponding moments. The positive parameters Îť1â, Îť2â, Îť3â, Îť4â are chosen to fix the relative importance of the mean, the co-variance, the co-skewness and the co-kurtosis.
Thus, we will be interested in the following optimization task (analogous to 17):
[TABLE]
If we set Îť2â=1 and Îť1â=Îť3â=Îť4â=0, then the solution of the task (20) coinsides with the solution of the classical PCA.
Let us briefly demonstrate that. Let X=[x1â,âŻ,xNâ] be the data matrix, Y=[y1â,âŻ,yNâ] be the SVD of X truncated at k-th term, Ďiâ(X) be an ith singular value of X. By Îźpcaâ we denote a probabilistic measure concentrated in points {yiâ}i=1Nâ.
In that case we have d\textscHMâ(Îźdataâ,Îźpcaâ)2=âĽN1âXXTâN1âYYTâĽF2â=N21ââi=k+1min{N,n}âĎi4â(X). But for any νâPkâ the covariance matrix cov(ν)=[ExâźÎ˝âxiâxjâ]i,jâ[n]â is of rank k. Therefore, by Eckart-Young-Mirskyâs theorem, we have d\textscHMâ(Îźdataâ,ν)2=âĽN1âXXTâcov(ν)âĽF2ââĽN21ââi=k+1min{N,n}âĎi4â(X). Thus, the minimum of d\textscHMâ(Îźdataâ,ν) is attained at ν=Îźpcaâ.
Thus, the task (20) can be considered as a direct generalization of PCA that takes into account higher moments. Note that the distance based on higher moments is a special case of maximum mean discrepance metric, where K(x,y)=âs=14ânsÎťsââ(xâ y)s. That is why we denote the task as HM-MMD-PCA. In Section 7 we prove that there is an efficient 2-approximating algorithm for the HM-MMD-PCA.
In H we additionally describe another algorithm for the HM-MMD-PCA based on a generic alternating scheme.
Wasserstein distance PCA (WD-PCA)
Another significant distance between probability measures with the origins in the transport theory is the Wasserstein distance (see [34]).
Let (Rn,âĽâ âĽ) be a
Banach space and pâĽ1. Between any two Borel probability measures Îź,ν on Rn with âŤâĽxâĽpdÎź<â and âŤâĽxâĽpdν<â the pth Wasserstein distance is:
[TABLE]
where Π(Ο,ν) is a set of all couplings of Ο and ν. The Wasserstein distance defines another version of LDR problem:
[TABLE]
In the B one can find proofs that in the case of l1â norm âĽxâĽ=âiââŁxiâ⣠and p=1, the task (22) corresponds to the well-studied robust PCA problem [35]. If, instead of the l1â-norm, we use the l2â-norm and set p=1, this leads to another well-studied task, which is known as the outlier pursuit problem [36, 37]. In the case of the l2â-norm and a general pâĽ1 we obtain the lpâ subspace approximation problem [38, 39]. Note that, except for the l2â subspace approximation problem, all these problems are NP-hard. In I we describe an algorithm for the WD-PCA in the case of l2â-norm and p=1.
Sufficient dimension reduction with optimized regression function (SDR-ORF).
Given a labeled dataset {(xiâ,yiâ)}i=1Nâ where xiââRn,yiââC (C is a finite set of classes for a classification, or R for a regression problem), the sufficient dimension reduction problem can be informally described as a problem of finding vectors w1â,âŻ,wkââRn such that conditional distributions satisfy p(yâŁw1Tâx,âŻ,wkTâx)âp(yâŁx) (possibly, under some additional assumptions on the form of p(yâŁx)).
We formulate the SDR-ORF problem as an optimization task:
[TABLE]
The object f:RnâR is a smooth real-valued function.
We assume that f is a candidate for the regression function and J(f) is a cost function that values how strongly f fits in this role. In practice for the regression case and for the binary classification case with 0-1 outputs we use the following cost functions correspondingly:
[TABLE]
and
[TABLE]
where H(y,p)=âylogpâ(1ây)log(1âp) and Ď >0 is a parameter.
By requiring fâFkâ, we assume that the regression function f satisfies (for k fixed in advance):
f(x)=g(w1Tâx,âŻ,wkTâx),
where w1â,âŻ,wkââRn. Thus, given an input x, an output of f depends on the projection of x onto span(w1â,âŻ,wkâ). The set span(w1â,âŻ,wkâ) is called the effective subspace. In J we describe an algorithm for the SDR-ORF problem.
5 Reduction of the optimization problem to ordinary functions
The central problem that our paper addresses is the optimization of an objective function over Gkâ˛â? In this section we suggest an approach based on penalty functions and kernels.
5.1 The definition of the penalty function
In this subsection we introduce a penalty function R(f). Let M:RnĂRnâC111Throughout the paper the kernel that induces the MMD distance is denoted by K and the kernel that is used to define a penalty is denoted by M. be some bounded function such that [M(ziâ,zjâ)]i,jâ[x]â is a positive semidefinite matrix for any {ziâ}iâ[x]ââRn,xâN. For f,g:RnâC let us denote
[TABLE]
For f,gâL1â(Rn),
[TABLE]
For general f,gâSâ˛(Rn) the expression â¨fâŁMâŁg⊠is defined if there are fĎľâ,gĎľââL1â(Rn) such that TfĎľââ=fâGĎľnâ, TgĎľââ=gâGĎľnâ and limĎľâ0ââ¨fĎľââŁMâŁgĎľââŠ=A<â. Then, â¨fâŁMâŁgâŠ=defA. For example, for continuous M we have â¨Î´nâŁMâŁÎ´nâŠ=M(0,0).
One can build a Gram matrix from the collection of functions {xiâf}i=1nâ, [â¨xiâfâŁMâŁxjâfâŠâ]1â¤i,jâ¤nâ. Let us denote a real part of the Gram matrix
[TABLE]
by Mfâ.
Theorem 3 concludes, from fâGkâ, that
dimspanRâ{x1âf,x2âf,âŻ,xnâf}â¤k.
Theorem 4**.**
Let M(x,y) be a bounded Lipschitz function. If f=(Tgââδnâk)UââGkâ˛â is such that {xiâg}i=1kââL1â(Rk),
then â¨xiâfâŁMâŁxjâf⊠is defined and rankMfââ¤k.
Definition 1**.**
Let AâRnĂn be a positive semidefinite matrix with eigenvalues Îť1ââĽÎť2ââĽâŻâĽÎťnâ (with counting multiplicities). Then, the Ky Fan kâanti-norm of A is âĽAâĽkâ=âi=1kâÎťn+1âkâ.
Let
[TABLE]
By construction, by penalizing the value of R(f), we enforce Mfâ to be close to some matrix of rank k. Equivalently, we enforce a real part of the Gram matrix of {xiâf}iâ[n]â to be of close to a rank k matrix. By Theorem 3, the condition
dimspanRâ{x1âf,x2âf,âŻ,xnâf}â¤k implies fâGkâ˛â, therefore, we enforce f to be close to some function from Gkâ˛â. In the next section we will justify the latter informal logic by reducing the optimization over Gkâ˛â to the optimization over ordinary functions with the penalty function R(f).
5.2 Proper kernels
For a function M(x,y):RnĂRnâC, let us denote by OMâ a linear operator between Dom(OMâ) and L2â(Rn) given by OMâ[f]=âŤRnâM(x,y)f(y)dy where Dom(OMâ)={fâL2â(Rn)âŁOMâ[f]âL2â(Rn)}.
For any operator O between spaces H1â and H2â, we denote its range by Range[O]={O(x)âŁxâH1â}.
Definition 2**.**
The function M(x,y):RnĂRnâC is called the proper kernel if and only if
OMâ:L2â(Rn)âL2â(Rn)* is a properly defined, strictly positive and self-adjoint operator,*
2. 2.
maxx,yââŁM(x,y)âŁ<â,
3. 3.
Range[OMâ]âŠS(Rn)â=S(Rn).
Note that the latter definition implies that M(y,x)=M(x,y)â (modulo some null set) and â¨f,OMâ[f]âŠL2â(Rn)â>0,âfâL2â(Rn),fî =0.
Example 2**.**
The Gaussian kernel is of special interest in applications:
M(x,y)=GĎnâ(xây).
It is captured by the following lemma:
Lemma 1**.**
If Îś,Îś^ââC(Rn) are bounded, âxÎś^â(x)>0, then M(x,y)=Îś(xây) is a proper kernel.
Proof.
Verification of the first three conditions is easy, so we only check the fourth condition.
Let us denote linear operators CÎśâ[f]=Îśâf and Ogâ[f](x)=g(x)f(x). Then we have F[CÎśâ[L2â(Rn)]]=OÎś^ââ[L2â(Rn)]âCcââ(Rn). Therefore, Range[OMâ]=CÎśâ[L2â(Rn)]âFâ1[Ccââ(Rn)]. Since Ccââ(Rn) is dense in S(Rn), then Fâ1[Ccââ(Rn)] also has this property. Thus, Range[OMâ]âŠS(Rn)â=S(Rn).
â
Besides the Gaussian kernel the lemma also captures a case of the Laplace kernel Îś(x)=eââŁxâŁ. It is well-known that the Fourier tranform of the Laplace kernel is the Poisson kernel: Îś^â(x)=(1+âŁxâŁ2)2n+1âcnââ (which is also proper).
For I:Gkâ˛ââŞS(Rn)âR+, it is natural to reduce the optimization task over tempered distributions
[TABLE]
to an optimization task over ordinary functions with a penalty termR,
[TABLE]
where we assume that the set of functions F is rich enough to approximate weakly solutions of (29), i.e. FâââGkâ˛â. Since we cannot guarantee that the minimum in (30) is attainable, we substitute it by infimum. For this reduction to be effective it is desirable to have the following property: if a sequence {fnâ}âF is such that I(fnâ)+ÎťnâR(fnâ)âfâFinfâ(I(f)+ÎťnâR(f))â+0 for Îťnâânââ+â (i.e. {fnâ} solves (30) for arbitrarily large values of the regularization parameter), then there exists a growing subsequence {nkâ} such that TfnkâââââT (weakly) where T is a solution of (29).
We make a thorough theoretical analysis of the case F=S(Rn). If to formulate in a simplified way, for the last property to hold, the sequence Tr(Mfnââ) should be bounded.
Details on the conditions under which this reduction holds can be found in the following subsection.
5.3 Regular solutions and reduction theorems for F=S(Rn)
For a sequence {fsâ}s=1âââSâ˛(Rn), sââLimâfsâ denotes a set of points fâSâ˛(Rn), such that there exists a growing sequence {siâ}âN and limiâââfsiââ=f.
For I:Gkâ˛ââŞS(Rn)âR+, it is natural to reduce the optimization task (29)
to an optimization task over ordinary functions with a penalty term (30).
To have an equivalence between (29) and (30) we need to assume that Iâs behaviour when approaching fâGkâ˛â from a set S(Rn) is continuous, i.e. for any sequence {fiâ}âS(Rn) such that TfiââââfâGkâ˛â, we have limiâââI(Tfiââ)=I(f).
Let us introduce the notion of a regular solution both for (29) and (30). Let
[TABLE]
Definition 3**.**
Any fâArgfâGkâ˛âminâI(f)âBkâ is called a regular solution of (29).
In other words, Bkâ formalizes a set of distributions from Gkâ˛â, that can be approached through sequences {fiâ}âGkâ, for which Tr(Mfiââ) does not blow up. Obviously, GkââBkââGkâ˛â. In applications, regular solutions include all ArgfâGkâ˛âminâI(f) if we choose the kernel M correctly. This regularity is important for a reduction to the penalty form (30), because when approaching a non-regular solution we are unable to guarantee a bounded behaviour of Mfâ (and of R(f)).
Definition 4**.**
A sequence {fiâ}1âââS(Rn) is said to solve (30) if
[TABLE]
where Ďľiââ+0 and Îťiââ+â,iâ+â. If, additionally, Tr(Mfiââ) is bounded, then {fiâ}1ââ is said to solve (30) regularly.
Let us define
[TABLE]
Theorem 5**.**
If M is a proper kernel, then
rsol(I(f),R(f))âArgfâGkâ˛âminâI(f).
Theorem 6**.**
If M is a proper kernel and rsol(I(f),R(f))î =â , then
[TABLE]
Theorem 7** (Reduction theorem).**
If M is a proper kernel, ArgfâGkâ˛âminâI(f)âBkâ and rsol(I(f),R(f))î =â , then
rsol(I(f),R(f))=ArgfâGkâ˛âminâI(f).
Suppose that we now solve a sequence of problems (30) and find {fsâ}1ââ. According to Theorems 5 and 6, the following are potential scenarios:
(1) Tr(Mfsââ) blows up and the convergence is not guaranteed. This situation can be avoided by controlling Tr(Mfâ) in an optimization process. In practice, when f has a parameterized form, this can be done by bounding parameters.
If Tr(Mfsââ) does not blow up, we still have two subcases:
(2.1) sââLimâTfsââî =â . This implies a positive outcome to approach (30) to the optimization problem, Problem (29).
(2.2) sââLimâTfsââ=â . This exotic situation can happen only if a sequence Tfsââ leaves any sequentially compact subset of Sâ˛(Rn). Bounding parameters also tackles this case.
Let us now concentrate on the task (30) and describe the alternating scheme for its solution.
6 The alternating scheme
We will concentrate on problem (30). It is known [25] that the Ky Fan anti-norm is a concave function, i.e. R(Ď)=âĽMĎââĽnâkâ depends on MĎâ in a concave way. It can be shown that the dependence of R(Ď) on Ď is both non-convex and non-concave, i.e. we deal with a non-convex optimization task.
Let B(H1â,H2â) denote a set of bounded linear operators between Hilbert spaces H1â and H2â.
For OâB(H1â,H2â) the rank of O is defined as dimR(O). Let L2râ(Rn) be the Hilbert space (over R) of real-valued functions from L2â(Rn) (i.e. the real-valued L2â-space) and L2ââ(Rn)=L2râ(Rn)ĂL2râ(Rn). The space L2ââ(Rn) is equivalent to L2â(Rn) treated as a linear space over R. Below we do not distinguish [Ď1â,Ď2â]âL2ââ(Rn) and Ď1â+iĎ2ââL2â(Rn).
It is easy to see that any OâB(L2ââ(Rn),Rn) can be given by formula:
[TABLE]
i.e. OâB(L2ââ(Rn),Rn) can be identified with a vector of functions
O=[Oiââ]i=1,nââ,OiââL2â(Rn) and the HilbertâSchmidt norm on B(L2ââ(Rn),Rn) (i.e. TrOâ Oâ) is
[TABLE]
Recall that for a Mercer kernel M, OMâ[Ď](x)=âŤRnâM(x,y)Ď(y)dy is a positive operator whose domain is Dom(OMâ)={fâL2â(Rn)âŁOMâ[f]âL2â(Rn)} and range is a subset of L2â(Rn). If we assume that Dom(OMâ) is dense in L2â(Rn), then its adjoint OMâ â and the square root OMââ:Dom(OMâ)âL2â(Rn) can be properly defined [40]. Thus, OMâ is self-adjoint. For any complex-valued function f such that TrMfâ<â let us introduce a linear operator Sfâ:L2ââ(Rn)âRn by the following rule:
[TABLE]
i.e. (Sfâ)iâ=OMââ[xiâf(x)],i=1,nâ. In the latter definition the expression â¨OMââ[xiâf(x)],ĎâŠL2â(Rn)â is finite due to â¨OMââ[xiâf(x)],OMââ[xiâf(x)]âŠL2â(Rn)â=(Mfâ)iiâ<â and the Cauchy-Schwarz inequality.
Theorem 8**.**
Let M be a Mercer kernel such that Dom(OMâ) is dense in L2â(Rn) and TrMfâ<â. Then, SfââB(L2ââ(Rn),Rn) and SfâSfâ â=Mfâ. Moreover,
[TABLE]
and the minimum is attained at S=PfâSfâ where Pfâ=âi=1kâuiâuiâ â and {uiâ}1kâ are unit eigenvectors of Mfâ corresponding to the k largest eigenvalues (counting multiplicities).
Proof.
The boundedness of Sfâ follows from the Cauchy-Schwarz inequality:
[TABLE]
and therefore:
[TABLE]
Thus, we have checked that Sfâ is bounded.
By definition, Sfâ â:RnâL2râ(Rn)ĂL2râ(Rn) and â¨u,Sfâ[Ď1â,Ď2â]âŠ=â¨Sfâ â[u],[Ď1â,Ď2â]âŠ, uâRn, [Ď1â,Ď2â]âL2râ(Rn)ĂL2râ(Rn). Let us denote f1â=Ref,f2â=Imf. It is easy to see that the following operator satisfies the latter identity:
[TABLE]
Since the adjoint is unique, then Sfâ â=O. Let us calculate SfâSfâ â:
[TABLE]
Thus, SfâSfâ â=Mfâ. Since TrSfâSfâ â<â and âĽSfâ â[u]âĽ2â¤â¨u,MfâuâŠ, we obtain Sfâ â is a bounded operator.
Let u1â,âŻunâ be orthonormal eigenvectors of Mfâ=SfâSfâ â and Îť1ââĽâŻâĽÎťnâ˛â>0 be corresponding nonzero eigenvalues. For Ďiâ=Îťiââ let us define
viâ=ĎiâSfâ â[uiâ]â.
Vector viâ corresponds to a pair of functions
[TABLE]
It is easy to see that v1â,âŻvnâ˛â is an orthonormal basis in ImSfâ â, and
Sfâ â can be expanded in the following way:
[TABLE]
and therefore,
SVD for Sfâ is
[TABLE]
By the Eckart-Young-Mirsky theorem (see Theorem 4.4.7 from [41]), an optimal S in SâB(L2ââ(Rn),Rn),rankSâ¤kminââĽSfââSâĽâ2â is defined by a truncation of SVD for Sfâ at kth term, i.e.
[TABLE]
where Pfâ=âi=1kâuiâuiâ â is a projection operator to first k principal components of Mfâ. Moreover, âĽSfââPfâSfââĽ2=âi=k+1nâ˛âĎi2â=âĽMfââĽnâkâ=R(f).
â
Given the new representation R(f)=SâB(L2ââ(Rn),Rn),rankSâ¤kminââĽSfââSâĽâ2â we have
[TABLE]
Thus, it is natural to view the Task (30) as a minimization of I(Ď)+ÎťâĽSĎââSâĽâ2â over two objects: fâF and SâB(L2ââ(Rn),Rn):rankSâ¤k.
The simplest approach to minimize a function over two arguments is to optimize alternatingly, i.e. first over f, and then over S:rankSâ¤k, and so on. Theorem 8 gives that the minimization over S is equivalent to the truncation of SVD(Sfâ) at the k-th term.
This idea, that we dub the alternating scheme (AS), is described in Algorithm 1.
The alternating algorithm 1 allows for a reformulation in the dual space. By this we mean that in Algorithm 1 we substitute Ďâtâ for the original Ďtâ. If the primal Algorithm 1 deals with operators SĎâ,SĎtâ1ââ, the dual version deals with vectors of functions GĎâââxâĎââ,GĎâââxâĎâtâ1ââ. Details of the dual algorithm can be found in E.
The objective I(f)+ΝR(f) can have many local minima due to the effect of the penalty term R(f). Therefore, the Alternating Scheme 1 is strongly dependant on the initialization step. One of such initialization procedures for the task (17) is described in the next section.
7 An approximate algorithm for the MMD-PCA
Let us analyze the task (17) in the case where K(x,y)=(xâ y)H(x,y) and H is a Mercer kernel (by construction, K is also a Mercer kernel). In this section we demonstrate that, given a distribution f(x)=N1ââi=1Nâδn(xâxiâ), a good guess for a k-dimensional space in which an optimal solution is supported is a span of the first k principal components of Hfâ (see the Algorithm 2).
A specifics of this type of kernels is that the MMD distance (induced by K) till an optimal solution of the MMD-PCA task is bounded below by the Ky Fan k-antinorm of Hfâ, as shown in the following theorem.
Theorem 9**.**
Let K(x,y)=(xâ y)H(x,y) where H:RnĂRnâR is a Mercer kernel, Dom(OKâ) is dense in L2â(Rn) and f is such that TrHfâ<â. Then,
[TABLE]
where âĽgâĽK2â=â¨gâŁKâŁgâŠ, Îť1ââĽâŻâĽÎťnâ are eigenvalues (counting multiplicities) of Hfâ.
Sketch..
Let us apply Theorem 8 to the kernel H and the function f.
Recall that an element OâB(L2ââ(Rn),Rn) can be identified with a vector-function [Oiâ]i=1nâ,OiââL2â(Rn) where O[Ď]iâ=Reâ¨Oiâ,ĎâŠL2â(Rn)â. Since Sfâ corresponds to [OHââ[xiâf(x)]]i=1nâ, the representation of Theorem 8 gives us
[TABLE]
The restriction rankSâ¤k is equivalent to dimLSââ¤k where LSâ={âi=1nâΞiâSiâ(x)âŁÎžiââR} and S corresponds to [Siâ]i=1nâ.
Let fâ˛âGkâ. By Theorem 3 we have dimspanRâ({x1âfâ˛,âŻ,xnâfâ˛})â¤k. By Theorem 4, â¨xiâfâ˛âŁHâŁxiâfâ˛âŠ is finite, therefore OHââ[xiâfâ˛(x)] can be properly defined and is in L2â(Rn). Therefore,
[TABLE]
For any fâ˛âGkâ one can set Siâ=OHââ[xiâfâ˛(x)] and search over all possible fâ˛âGkâ in the minimization operator. Thus,
[TABLE]
â
Let Îźdataâ be a uniform distribution over {xiâ}i=1Nâ and dMMDâ be the MMD distance induced by K(x,y)=(xâ y)H(x,y). The last theorem can be applied to a smoothed empirical distribution fÎľâ(x)=N1ââi=1NâGÎľnâ(xâxiâ) and then, we can send Îľâ0. All the more, the inequality will be satisfied if we search over ÎźâPkâ due to TÎźââGkâ. Thus,
[TABLE]
where Îť1ââĽâŻâĽÎťnâ are eigenvalues (counting multiplicities) of Hfâ, f(x)=N1ââi=1Nâδn(xâxiâ). Thus, (âi=k+1nâÎťiâ)1/2 is a lower bound of the solution of (17).
For such an important practical case as the HM-MMD-PCA, a multiple of the square root of the Ky Fan k-antinorm of Hfâ is also an upper bound.
Theorem 10**.**
Let H(x,y)=P(xâ y) where P(x)=c0â+c1âx+âŻ+clâ1âxlâ1, ciââĽ0,iâ[lâ1], f(x)=N1ââi=1Nâδn(xâxiâ)=TÎźdataââ and Îť1ââĽÎť2ââĽâŻâĽÎťnâ are eigenvalues of Hfâ. Then,
[TABLE]
The following corollary is straightforward from the last theorem.
Corollary 1**.**
A 2-approximating solution of the task (20) can be efficiently found by the Algorithm 2.
For the case when H is the Gaussian kernel the situation is slightly trickier.
Theorem 11**.**
Let H(x,y)=eâ2Ď2âĽxâyâĽ2â, miâ=âŤRnââĽxâĽiâŁf(x)âŁdx<â,iâ[4] and Îť1ââĽÎť2ââĽâŻâĽÎťnâ are eigenvalues of Hfâ. Then,
[TABLE]
where M=O(m2â+2âĎm4âm2ââ+2âĎm3â).
An analogous theorem can be proved for H(x,y)=(1+Ď2âĽxâyâĽ2)â2n+1â, i.e. the Poisson kernel.
8 Experiments
The alternating scheme 1 is a general optimization method that needs to be specified for every optimization task. We designed numerical specifications of the alternating scheme 1 for all 4 optimization tasks: (17), (20), (22) and (23) and made experiments with all of them. Details of the algorithms, i.e. numerical methods to minimize over Ď and calculate MĎtââ, can be found in Appendix. Note that for WD-PCA (22) we exploit the alternating scheme in the initial form (i.e. 1), and for MMD-PCA (17), HM-MMD-PCA (20) and SDR-ORF (23) we use the dual version of the scheme.
Behaviour of the Gaussian MMD-PCA for small h.
We studied the difference in the behavior of PCA and a solution of (17), for the distance function induced by the kernel K(x,y)=(8Ďh2)nâ1âeâ8h2âĽxâĽ2â, obtained by the alternating scheme 1 (AS for MMD-PCA), for the case when h is small compared to the standard deviation of features.
Experiments show that they are sharply different when data points are sampled along a low-dimensional manifold M, which is bent globally, goes through the origin O and has a large curvature at O (see Fig. 1(a)). Since generated points do not lie on an affine subspace, the global nature of PCA makes it hard to interprete principal directions.
We select a smooth function f:Rnâ1âR, such that f(0)=0 and generate points in the following way: points x1â,x2â,âŻ,xNââź[â10,10]nâ1 are sampled uniformly, after calculation of yiâ=f(xiâ) we add some noise: ziâ=(xiâ,yiâ)+Ďľiâ,ĎľiââźN(0,0.01Inâ). Both PCA and MMD-PCA are applied to the dataset (first 3 pictures on Figure 1(a)). As we see, MMD-PCA, unlike PCA, tries to catch ideal alignments of points rather that searching for a global alignment of points (which is non-existent). This property of MMD-PCA makes it a promising tool for a calculation of the tangent space to a data manifold at a given point. Fourth picture shows that when we have 2 equally important directions in data such that the first principal direction of PCA is between them (red line), and we set k=1, then MMD-PCA (green line) always chooses one of those directions. These experimental results are aligned with the theoretical observation given in Example 1, in which we show that the Gaussian MMD-PCA task for hâ0+ is equivalent to finding a k-dimensional subspace that contains as many points of a dataset as possible. Thus, the Gaussian MMD-PCA can be considered as a method that can be potentially used to tackle the latter NP-hard problem. Some informal discussion of this problem can be found in [42].
Experiments with outlier detection (MMD-PCA, HM-MMD-PCA, WD-PCA).
Following the experiment setup of [37], we choose parameters N=n=400,\updelta=0.05 (or 0.1), k=10 and generate random matrices AâRN(1â\updelta)Ăk,BâRnĂk whose entries are iid as N(0,1). Then, columns of the matrix BATâRnĂN(1â\updelta) (whose rank is â¤k) are concatenated with columns of the matrix CâRnĂN\updelta: X=concat(BAT,C)âRnĂN. The entries in C are either iid as N(0,1) (case I) or N\updelta copies of the same vector whose entries are iid as N(0,1) (case II). Let X=[x1â,âŻ,xNâ], i.e. columns of X are the data points.
Thus, N(1â\updelta) columns of BAT lie in a k-dimensional subspace of Rn and N\updelta columns of C are outliers, and solutions of tasks (17), (20) or (22) for this dataset are expected to be supported in a column space of BAT.
After every iteration (step t of the alternating scheme 1) we calculate the Frobenius distance between the projection operator Ptâ of 1 and the projection operator P to the column space of BAT, i.e. âĽPtââPâĽFâ. For the task (22), the dependence of âĽPtââPâĽFâ on t for different values of parameters \updelta and Îť is shown in Figure 1(b). For tasks (17), (20) the behaviour of the alternating scheme is similar, 7 iterations are enough to approach the optimal subspace.
One of main goals of this experimental setup was to study how the kernel M, that defines the regularizer R(f) by equation(28), affects the quality of a solution.
Besides the speed of convergence we were interested in how âĽPââPâĽFâ, where Pâ=limtâââPtâ is the final projection operator (e.g. P20â in practice), depends on the parameter Ď of the kernel M=GĎnâ (bandwidth). It is natural to expect the quality of the solution Pâ to degrade as Ďâ+â (this corresponds to M(x,y)â0), and, less trivially, as Ďâ0 (this corresponds to M(x,y)âδn(xây)). As the right plot on Figure 1(b) shows, for the HM-MMD-PCA, the solution Pâ is close to the correct P if the bandwidth Ď is in interval [eâ2,e3] and it degrades beyond that interval. For the Gaussian MMD-PCA the degrading occurs beyond [e1.3,e3]. For the WD-PCA the interval for Ď is sligtly narrower than [e1.3,e3]. Our numerical specification of the alternating scheme for WD-PCA involves training regularized Generative Adversarial Network (see for details I) and are based on numerically unstable algorithms for the Wasserstein distance minimization.
Finding numerical specifications for WD-PCA with a more stable behavior is a future work.
Experiments with SDR-ORF.
We made experiments on the standard datasets, Heart, Breast Cancer, Ionosphere, Diabetes, Boston House Prices and Wine Quality. First, we applied the Sliced Inverse Regression algorithm (SIR) [13] to the training set and calculated the effective subspace for k=2,3. All points were projected onto that space and we obtained two- or three-dimensional representations of input points. In the last step we applied the ten nearest neighbors algorithm (KNN) to predict outputs (based on reduced inputs) on the test set (for the regression case, the 10-KNN regression was used). The same scheme was repeated with PCA, Kernel Dimensionality Reduction (KDR) algorithm [18] and the alternating scheme 1 (AS) adapted for the SDR-ORF.
We experimented with the dual version of algorithm 1, setting (after the data was standardized) the kernelâs parameter Ď=0.8222Since the role of the parameter Ď is similar to that of the bandwidth in the kernel density estimation, we use Silvermanâs rule of thumb to set Ď=Nâ1/(n+4). and Îť=10.0. Details of its numerical implementation can be found in J.
In the table 1 one can see the obtained test set accuracy on the classification tasks and R2 on the regression tasks. As we see from the table 1, after reducing the dimension of an input to k=2,3, we are still able to obtain good accuracy of prediction on a test set and the AS for the SDR-ORF is competitive in comparison with other methods. Note that all listed datasets are of moderate size and our Python scripts managed to compute an effective subspace in 3-5 minutes on a PC with GTX Titan X (Pascal), Intel Core i7-7700K (4.20 GHz), 64 GB RAM.
Experiments with the shadow/black removal. We made experiments with Yale B dataset [43], which is a popular benchmark for testing robust versions of PCA.
That dataset contains images of 28 human subjects under 9 poses and 64 illumination conditions.
Test images used in the experiments are cropped and re-sized to 168x192 images, making the dimensionality of every image 32256. Thus, each human subject corresponds to a collection of 32256-dimensional vectors that lie on some low-dimensional subspace L of R32256. We search for this subspace, assuming that its dimension is either 1 or 5, using PCA and the Algorithm 2 with the kernel K(x,y)=(xâ y)eânâĽxâyâĽ2â (which we simply call Gauss). Our experiments
showed that behaviour of PCA and Gauss are quite similar if the dimension of L is 5, though Gauss removes more shadows and preserves more details of an original image if the dimension of L is 1 (see Figures 2 and 3). A processing of each human subject by Gauss takes seconds on Google Colab.
Experiments with the background modeling. For these experiments we use the dataset for testing background estimation algorithms SBMnet [44]. The dataset contains frames of short videos and the frame of a background for each video (so called the ground truth). Spatial resolutions of the videos vary from 240x240 to 800x600. Thus, a collection of frames of every video is a set of high-dimensional vectors (with a dimension up to 480000) that, again, lie on a low dimensional subspace L. We assume that the dimension of L is 5.
We recover L using PCA and the Algorithm 2 for kernels K(x,y)=âi=14â(nxâ yâ)i, K(x,y)=(xâ y)eânaâĽxâyâĽ2â, K(x,y)=(xâ y)eânâaâĽxâyâĽâ and K(x,y)=(xâ y)(1+naâĽxâyâĽ2â)â2n+1â (which we simply call Kurtosis, Gauss, Laplace and Poisson respectively). Recall that, according to corollary 1, this algorithm is 2-approximating for Kurtosis. Subsequently, we compute the median of the vectors, projected onto L, and define the latter to be the recovered background image (see Figure 4). Measures of consistency with the ground truth backgrounds are then calculated using Python scripts downloaded from [45]. Six measures are used: AGE (average of the gray-level absolute difference between a ground truth image and a computed background image), pEPs (percentage of pixels in a computed background whose value differs from the value of the corresponding pixel in a ground truth by more than a threshold), pCEPS (percentage of pixels whose 4-connected neighbors are also error pixels), MSSSIM (estimate of the perceived visual distortion), PSNR (Peak-Signal-to-Noise-Ratio, or 10log10â(MSE(Lâ1)2â) where L is the maximum number of grey levels and MSE is the mean squared error between a ground truth and a computed background images), CQM (Color image Quality Measure). Codes that compute listed metrics can be found in [45]. As shown on Table 6, experiments again demonstrated very similar behavior of PCA, Kurtosis, Gauss, Laplace and Poisson with very close accuracies
of the background reconstruction. Best measures of consistency with the ground truth images were achieved for Gauss (a=5.0,25.0) and Laplace (a=5.0). For a comparison with other methods, we also give accuracies of other methods based on low-rank approximation and an accuracy of a state-of-the-art method that was specifically tailored for that task [46]. On figure 5 one can see that background images computed by PCA and Gauss are almost identical, though Gauss is less likely than PCA to add local artefacts, such as blurs, noise etc.
The processing of the whole SBMnet dataset using PCA/Kurtosis/Gauss/ Laplace takes approximately the same time â 25 minutes on a cluster equipped with Intel Xeon Platinum 8168 Processors (33M Cache, 2.70 GHz) and 1TB RAM.
The code is available on github to facilitate the reproducibility of our results.
Scalability of algorithms. A major practical limitation of the alternating scheme 1 comes from the fact that it involves an optimization over a set of functions F, which in applications is either a feedforward neural network (as in our specifications of the AS for SDR-ORF, MMD-PCA, HM-MMD-PCA) or a generative neural network (WD-PCA). A speed of optimization is also strongly dependant on the objectiveâs landscape.
Thus, for large scale datasets, with a dimension of vectors âŤ103, and a sophisticated structure of a regression function (SDR-ORF) or a data distribution (MMD-PCA, HM-MMD-PCA, WD-PCA), the alternating scheme is substantially slower in comparison with other popular methods (such as PCA for the UDR, or SIR/KDR for the SDR).
In the special case of MMD-PCA (that includes HM-MMD-PCA), the approximate algorithm 2 can be used as a surrogate of the alternating scheme. It requires the same time as PCA and can be applied to datasets with a dimension of vectors âź106â107. Also, the Algorithm 2 can be used for an initialization of the alternating scheme.
9 Conclusions
We develop a new optimization framework for LDR problems.
The alternating scheme for the optimization task demonstrates both the computational efficiency and the applicability to real-world data. The algorithm performs quite stably when we vary most of the hyperparameters, though it crucially depends on two parameters, the bandwidth of the âsmoothingâ kernel M, Ď, and the penalty parameter Îť. We believe that the MMD-PCA/HM-MMD-PCA/WD-PCA methods for UDR could be used as an alternative to PCA in study fields in which data demonstrate âheavy-tailedâ and ânon-Gaussianâ behavior, such as financial applications or computer vision. Also, our formulation of SDR-ORF is free from any assumptions on the distribution of input-output pairs, which makes it an alternative to other methods of efficient subspace estimation.
A more detailed report on these topics is a subject of future research.
Appendix A Proofs for section 3
A.1 Proof of Theorem 1: given for completeness
Proof.
The inclusion Gkâ˛ââGkâââ follows from a well-known fact that S(Rk) is dense in Sâ˛(Rk). I.e. for any fâSâ˛(Rk) one can always find a sequence {fiâ}âS(Rk) such that Tfiââââf. Therefore, for any (fâδnâk)UââGkâ˛â there is a sequence {(Tfiâââδnâk)Uâ}âGkâ such that (Tfiâââδnâk)Uâââ(fâδnâk)Uâ. Thus, Gkâ˛ââGkâââ.
Since GkââGkâ˛â, to prove Gkâ˛â=Gkâââ it is enough to show that Gkâ˛â is sequentially closed.
We need a simple fact from a theory of distributions.
Lemma 2**.**
If TiâââT and ĎiââĎ, then â¨Tiâ,ĎiââŠââ¨T,ĎâŠ.
Proof of Lemma.
Schwartz space S(Rn) is a FrĂŠchet space, therefore the Banach-Steinhaus theorem applies to Sâ˛(Rn). Since TiâââT, we have supiââŁâ¨Tiâ,ĎâŠâŁ<â for any ĎâS(Rn). From the Banach-Steinhaus theorem, applied to a set {Tiâ}1ââ, we obtain for any Ďľ>0, there is a neighbourhood U of 0âS(Rn) such that âŁâ¨Tiâ,ĎâŠâŁ<Ďľ whenever ĎâU. Thus, âŁâ¨Tiâ,ĎiââĎâŠâŁ<Ďľ for a large enough i. From that we conclude that â¨Tiâ,ĎiââŠââ¨T,ĎâŠ.
â
For any TâSâ˛(Rn) and ĎâS(Rnâk), let us define TĎâSâ˛(Rk) as â¨TĎ,ĎâŠ=â¨T,ĎâĎâŠ.
Suppose that {fiâ}1âââSâ˛(Rk), {Uiâ}1ââ are such that (fiââδnâk)Uiââââf. We need to prove that fâGkâ˛â. Since a set of orthogonal matrices is compact, then one can always find a subsequence {Uniââ} such that UniâââU. Since (fniâââδnâk)Uniâââââf and Ď(Uniââx)âĎ(Ux) (for any fixed ĎâS(Rn)), using lemma 2 we obtain:
[TABLE]
Thus, we have fniâââδnâkââfUTâ.
From the last we see that fniââââfUTĎâ where Ď is such that Ď(0)=1.
Therefore, fUTâ=fUTĎââδnâk and f=(fUTĎââδnâk)UââGkâ˛â.
â
Let us prove that from T=(fâδnâk)Uâ, fâSâ˛(Rk), UTU=Inâ it follows that dimspanRâ{x1âT,x2âT,âŻ,xnâT}â¤k.
It is easy to see that xiâ[fâδnâk]=0 if i>k. If U=[u1â,âŻ,unâ]T, then for i>k we have 0=(xiâ[fâδnâk])Uâ=uiTâx(fâδnâk)Uâ=uiTâxT.
Thus, we have nâk orthogonal vectors, uk+1â,âŻ,unâ, such that
[TABLE]
Using standard linear algebra we obtain there are at most kⲠdistributions xi1ââT,âŻ,xikâ˛ââT,kâ˛â¤k that form a basis of spanRâ{xiâT}1nâ.
â
To prove the second part of theorem we need the following lemma.
Lemma 3**.**
If TâSâ˛(Rn) is such that yiâT=0 for any i>k, then TâGkâ˛â.
Proof of lemma.
Recall from functional analysis, for fâSâ˛(Rn), the tempered distribution âxiââfâ is defined by the condition â¨âxiââfâ,ĎâŠ=ââ¨f,âxiââĎââŠ.
Once the Fourier transform is applied, our lemmaâs dual version is equivalent to the following formulation: if âxiââfâ=0,i>k, then fâFkâââ. Let us prove it in this formulation.
Recall that a set of infinitely differentiable functions with a compact support is denoted by Ccââ(R).
Suppose ĎâS(Rn) and pâCcââ(R) are chosen in such a way that âŤââââp(yiâ)dyiâ=1, supppâ[A,B]. Let us define:
[TABLE]
It is easy to see that for any ÎąâNnâ1,Îąâ˛âN,βâNnâ1,βâ˛âN we have (at least one derivative over xiâ is present):
[TABLE]
The terms xâiÎąâxiÎąâ˛ââxâiβââxiβâ˛ââβ,βâ˛Ď(x)â and xiÎąâ˛ââxiβâ˛ââβâ˛p(xiâ)â are bounded by the definition of S(Rn),Ccââ(R). The boundedness of âŤââââxâiÎąââxâiβââβĎ(xâiâ,yiâ)âdyiâ is a consequence of the inequality
âŁxâiÎąââxâiβââβĎ(xâiâ,yiâ)ââŁâ¤1+yi2âCâ (which holds because ĎâS(Rn)).
Analogously (when not a single derivative over xiâ is present):
[TABLE]
The second term is 0 when xiââ¤A. It is also bounded when xiâ>A because âŁxâiÎąââxâiβââβĎ(xâiâ,yiâ)ââŁâ¤(1+yi2â)Îąâ˛+1Câ˛â. Therefore,
[TABLE]
The latter is bounded because limxiââ+âââŁxiââŁÎąâ˛âŤxiâââ(1+yi2â)Îąâ˛+1Câ˛âdyiâ=0.
The first term is 0 when xiââĽB and for xiâ<B it satisfies:
[TABLE]
The latter is also bounded, since limxiââââââŁxiââŁÎąâ˛âŤââxiââ(1+yi2â)Îąâ˛+1Câ˛âdyiâ=0.
Thus, xÎąâxβâβr(x)â is bounded and râS(Rn). Therefore, âxiââfâ=0 implies:
[TABLE]
Since this sequence of arguments works for any i>k, we can apply them sequentially to initial ĎâS(Rn) w.r.t. xk+1â,...,xnâ. Thus, for any pk+1â,...,pnââCcâ(R) such that âŤââââpiâ(yiâ)dyiâ=1 we obtain:
[TABLE]
Moreover, since Ccââ(R) is dense in S(R), we can assume that pk+1â,...,pnââS(R).
For the inverse Fourier transform T=Fâ1[f] the latter condition becomes equivalent to:
[TABLE]
for any pk+1â˛â,...,pnâ˛ââS(R) such that piâ˛â(0)=1. Let us define piâ˛â(xiâ)=eâxi2â. It is easy to check that T=gâδnâk where gâSâ˛(Rk),â¨g,ĎâŠ=â¨T,eââŁxk+1:nââŁ2Ď(x1:kâ)⊠for ĎâS(Rk). Thus, TâGkâ˛â and lemma is proved.
â
If dimspanRâ{x1âT,x2âT,âŻ,xnâT}â¤k, then
[TABLE]
Thus, there exist at least nâk orthonormal vectors vk+1â,âŻ,vnâ, such that [x1âT,âŻ,xnâT]viâ=0. Therefore, [x1âT,âŻ,xnâT]viâ=(viTâx)T=0.
Let us complete vk+1â,âŻ,vnâ to form an orthonormal basis of Rn: v1â,âŻ,vnâ. Let us define a matrix V=[v1â,âŻ,vnâ]. It is easy to see that:
[TABLE]
Since for i>k we have (viTâx)T=0, then xiâTVâ=0. Using lemma 3 we obtain TVââGkâ˛â. Therefore, (TVâ)VTâ=TâGkâ˛â. Theorem proved.
â
Appendix B Structure of WD-PCA
Recall that (Rn,âĽâ âĽ) is a
Banach space and pâĽ1.
Now, let us consider an optimization problem: for a given XâRnĂN solve
[TABLE]
where âĽâ âĽpâ is a norm on RnĂN defined by âĽ[s1â,âŻ,sNâ]âĽ=def(âi=1NââĽsiââĽp)1/p.
The following simple theorem shows that the two tasks are connected so that the solution of one directly leads to anotherâs solution.
Theorem 12**.**
Given data points {x1â,âŻ,xNâ}, let X=[x1â,âŻ,xNâ]âRnĂN. Then,
[TABLE]
Moreover, minνâPkââWpâ(Îźdataâ,ν) is attained on νâ, where νâ is a uniform distribution over {yiâ}i=1Nâ and [y1â,âŻ,yNâ]âargminYâRnĂN,rank(Y)â¤kââĽXâYâĽpâ.
Proof.
Let us prove first that infÎźâPkââWpâ(Îźdataâ,Îź)â¤N1ââĽXâYââĽpâ where
[TABLE]
Let Ď be a uniform distribution over {(xiâ,yiâ)}i=1Nâ and Îźâ be a uniform distribution over {yiâ}i=1Nâ. Since ĎâÎ (Îźdataâ,Îźâ), we obtain Wpâ(Îźdataâ,Îźâ)â¤(N1ââi=1NââĽxiââyiââĽp)1/p=Np1ââĽXâYââĽpâ. The support of Îźâ is k-dimensional, because rank(Yâ)â¤k. Thus, we have ÎźââPkâ and
infÎźâPkââWpâ(Îźdataâ,Îź)â¤Wpâ(Îźdataâ,Îźâ)â¤Np1ââĽXâYââĽpâ. Now, if we prove the inverse inequality, i.e. infÎźâPkââWpâ(Îźdataâ,Îź)âĽNp1ââĽXâYââĽpâ, this will imply that infÎźâPkââWpâ(Îźdataâ,Îź)=Np1ââĽXâYââĽpâ and therefore, infÎźâPkââWpâ(Îźdataâ,Îź)=Wpâ(Îźdataâ,Îźâ). This will in the end give us
ÎźââarginfÎźâPkââWpâ(Îźdataâ,Îź).
Let {Îźtâ}1ââ be such that
ÎźtââPkâ and Wpâ(Îźdataâ,Îźtâ)âinfÎźâPkââWpâ(Îźdataâ,Îź)â0. Let Ltâ denote a k-dimensional support of Îźtâ and Ptâ is a projection operator onto Ltâ.
Let Îźtââ be a uniform distribution over {Ptâx1â,âŻ,PtâxNâ}, i.e. Îźtââ(A)=N1ââi=1Nâ[PtâxiââA]. It is easy to see that Wpâ(Îźtââ,Îźdataâ)â¤Wpâ(Îźtâ,Îźdataâ), because Îźtââ and Îźtâ share the same k-dimensional support Ltâ, but the âtransportation of a massâ concentrated in point xiâ of the empirical distribution Îźempâ can be most optimally done by just moving it to Ptâxiâ (i.e. to the closest point on Ltâ). Thus, we have infÎźâPkââWpâ(Îźdataâ,Îź)â¤Wpâ(Îźdataâ,Îźtââ)â¤Wpâ(Îźdataâ,Îźtâ), and therefore, Wpâ(Îźdataâ,Îźtââ)âinfÎźâPkââWpâ(Îźdataâ,Îź)â0.
Since a set of projection operators is compact, one can always extract a subsequence {Ptsââ}s=1ââ, such that PtsâââP. It is easy to see that ÎźtsââââÎźââ (i.e. Wpâ(Îźtsâââ,Îźââ)â0) where Îźââ is a uniform distribution over {Px1â,âŻ,PxNâ}. For that distribution we have
[TABLE]
Thus, the infinum is attained on Îźââ.
It is easy to see that Wpâ(Îźdataâ,Îźââ)=Np1ââĽXâPXâĽpâ. Since rank(PX)â¤k we obtain Wpâ(Îźdataâ,Îźââ)âĽNp1âminYâRnĂN,rank(Y)â¤kââĽXâYâĽpâ=âĽXâYââĽpâ. This completes the proof.
â
Note that in the case of l1â norm and p=1, i.e. âĽxâĽ=âiââŁxiââŁ, the task 66 corresponds to the well-studied robust PCA problem [35].
If, instead of the l1â-norm, we use the l2â-norm and pâĽ1, this leads to another task:
[TABLE]
where âĽ[s1â,âŻ,sNâ]âĽp,2â=(âi=1NââĽsiââĽ2pâ)1/p. This task has many applications in mathematics and is known as the subspace approximation problem [39] .
Appendix C Proper kernels and proof of Theorem 4
C.1 Proof of Theorem 4
Let us first show that â¨fâŁMâŁg⊠is defined for any f=(Taââδnâk)UââGkâ and g=(Tbââδnâk)VââGkâ where a,bâL1â(Rk).
We have
[TABLE]
Let us denote aĎľâ=aâGĎľkâ and bĎľâ=bâGĎľkâ.
It is easy to see that
[TABLE]
From a well-known property of the Weierstrass transform we have
[TABLE]
From this we obtain that
[TABLE]
Thus, â¨fĎľââŁMâŁgĎľâ⊠is properly defined and
[TABLE]
where
[TABLE]
Let Ukâ,VkââRnĂn be matrices that comprise the first k rows of U,V correspondingly and nâk zero rows below. Also, let L denote the Lipschitz constant for M such that âŁM(x,y)âM(xâ˛,yâ˛)âŁâ¤L(âŁxâxâ˛âŁ+âŁyâyâ˛âŁ).
For the function MĎľâ(x1:kâ,y1:kâ) we have:
[TABLE]
Thus, there exists bounded M~(x1:kâ,y1:kâ)=M(UkTâx,VkTây) such that
[TABLE]
Further we assume that Ďľ>0 is small enough, so that MĎľâ(x1:kâ,y1:kâ)â¤C=2maxâŁM(x,y)âŁ.
Now we have:
[TABLE]
It is well-known (e.g. see Theorem 2.25 from [49]) that
âĽaĎľââaâĽLpââ, âĽbĎľââbâĽLpâââ0, âĽaĎľââĽL1âââ¤âĽaâĽL1ââ and âĽMĎľââM~âĽLââââ0. Thus, limĎľâ0ââ¨fĎľââŁMâŁgĎľâ⊠exists and â¨fâŁMâŁg⊠is defined.
Let us now prove that rankMfââ¤k.
The function fâGkâ˛â is such that f=(Tgââδnâk)Uâ where {xiâg}i=1kââL1â(Rk) and U=[w1â,âŻ,wnââ] is an orthogonal matrix. By construction,
[TABLE]
Let us now denote V=[u1â,âŻ,unââ]âRkĂn a submatrix of U in which only first k rows of U are present.
Then, the latter integral is equal to
[TABLE]
where
[TABLE]
is the Gram matrix of the collection {xiâg(x1:kâ)}i=1kââL1â(Rk).
For any f=(Tlââδnâk)UââGkâ and Ď>0, let us define fĎâ as
[TABLE]
We have TfĎââââ(Tlââδnâk)Uâ as Ďâ+0.
Lemma 4**.**
For any fâGkâ, limĎâ+0ââ¨xiâfĎââŁMâŁxjâfĎââŠ=0, for any (i,j)â/{1,...,k}2, and supĎâ[0,1]ââ¨xiâfĎââŁMâŁxjâfĎââŠ<â, for any (i,j)â{1,...,k}2.
Proof.
W.l.o.g. we can assume that f=Tlââδnâk,lâS(Rk).
If i>k,jâ¤k we have
[TABLE]
where P(x)=âŤRnâ2ĎĎ2ânâk1âyjâM(x,y)eâ2Ď2âŁyk+1:nââŁ2âlĎâ(y1:kâ)dy. Using the HĂślder inequality we obtain
[TABLE]
Since âŁM(x,y)âŁâ¤Îł for some Îł, we have
[TABLE]
Thus,
[TABLE]
Using âĽlĎââĽL1â(Rk)âââĽlâĽL1â(Rk)ââĎâ+00, âĽyjâlĎââĽL1â(Rk)âââĽyjâlâĽL1â(Rk)ââĎâ+00, we see the boundedness of âĽlĎââĽL1â(Rk)âÎłâĽyjâlĎââĽL1â(Rk)â and proceed
[TABLE]
It is easy to see that âĽ2ĎĎ2ânâk1âxiâeâ2Ď2âŁxk+1:nââŁ2ââĽL1â(Rnâk)ââ0 as Ďâ0, therefore â¨xiâfĎââŁMâŁxjâfĎââŠâ0.
Similarly, we can prove that â¨xiâfĎââŁMâŁxjâfĎââŠâ0 if i,j>k.
The entries of the main kĂk minor [â¨xiâfĎââŁMâŁxjâfĎââŠ]1â¤i,jâ¤kâ are bounded, due to
[TABLE]
Again, using âĽlĎââĽL1â(Rk)âââĽlâĽL1â(Rk)ââĎâ+00, âĽyjâlĎââĽL1â(Rk)âââĽyjâlâĽL1â(Rk)ââĎâ+00, we obtain the boundedness of RHS.
â
Corollary 2**.**
For any fâGkâ, limĎâ0âR(fĎâ)=0.
Proof.
W.l.o.g. we can assume that f=Tlââδnâk,lâS(Rk).
By lemma, all entries of MfĎââ except those of the main kĂk minor approach [math] as Ďâ0. This means that
limĎâ+0âQ(fĎâ)=0,
where Q(fĎâ)=âi=k+1nââ¨xiâfĎââŁMâŁxiâfĎââŠ. Let v1â,âŻ,vnâ be unit eigenvectors of MfĎââ corresponding to the eigenvalues Îť1ââĽâŻâĽÎťnâ, P=âi=k+1nâeiâeiTâ, then
[TABLE]
Since R(fĎâ)â¤Q(fĎâ), we obtain limĎâ0âR(fĎâ)=0.
â
Suppose that a sequence {fiâ}s=1âââS(Rn) regularly solves (7) and TâiââLimâfiâ. W.l.o.g. we can assume that TfiââââT and Tr(Mfiââ) is bounded and I(fiâ)+ÎťiâR(fiâ)â¤fâS(Rn)infâI(f)+ÎťiâR(f)+Ďľiâ,Ďľiââ0.
Below we use continuity of I and corollary 2:
[TABLE]
from which we conclude that ÎťiâR(fiâ)â¤fâGkâinfâI(f)+Ďľiâ and, therefore, R(fiâ)âiââ0.
For each l, let us define Plâ as the projection operator to a subspace spanned by first principal components of the matrix Mflâââ, i.e.
[TABLE]
where v1lâ,...,vklâ are orthonormal eigenvectors that correspond to k largest eigenvalues of Mflâââ. From the Eckart-Young-Mirsky theorem we see that R(flâ)=âĽMflââââPlâMflââââĽF2â. Since a set of all projection operators {PâRnĂnâŁP2=P,PT=P} is a compact subset of Rn2, one can always find a projection operator P=âi=1kâviâviTâ and a growing subsequence {lsâ} such that âĽPlsâââPâĽFââ0 as sââ. Thus, for the subsequence {flsââ} we have
[TABLE]
and using the boundedness of Tr(Mfsââ) we obtain âĽMflsâââââPMflsâââââĽFââ0.
Since âĽMflsâââââPMflsâââââĽFââ0, let us complete v1â,...,vkâ to an orthonormal basis v1â,...,vnâ and make the change of variables yiâ=viTâx. Let us denote V=[v1â,...,vnââ] and let VT=[w1â,...,wnââ]. Then, after that change of variables any function f(x) corresponds to fâ˛(y)=f(Vy) and the kernel M corresponds to Mâ˛(y,yâ˛)=M(Vy,Vyâ˛). After we apply that change of variables in the integral expression of â¨xiâfâŁMâŁxjâfâŠ, we obtain
[TABLE]
I.e. Mfâ=VMfâ˛â˛âVT, or Mfâ˛â˛â=VTMfâV. Note that P=VInkâVT where Inkâ is a diagonal matrix whose main kĂk minor is the identity matrix, and all other entries are zeros.
Using the fact that the Frobenius norm of orthogonally similar matrices are equal and the identity VTMflsââââV=VTMflsâââVâ, we obtain
[TABLE]
Thus, the property âĽMflsâââââPMflsâââââĽFââ0 implies
[TABLE]
Moreover, for i=j we have Reâ¨yiâflsââ˛ââŁMâ˛âŁyjâflsââ˛ââŠ=â¨yiâflsââ˛ââŁMâ˛âŁyjâflsââ˛ââŠ.
It is easy to see that after the change of variables we still have flsââ˛âââTVâ.
Since flsââ˛ââS(Rn), we have yiâflsââ˛ââS(Rn) and, therefore, yiâflsââ˛ââL2â(Rn). Let us treat now MⲠas an operator OMâ˛â:L2â(Rn)âL2â(Rn),OMâ˛â[f](x)=âŤRnâMâ˛(x,y)f(y)dy.
Let us take any function ĎâL2â(Rn) such that Ď=OMâ˛â[Ď]âS(Rn).
Since OMâ˛â is a strictly positive self-adjoint operator, by the Cauchy-Schwarz inequality, we obtain
[TABLE]
Therefore, for any ĎâRange[OMâ˛â]âŠS(Rn) and i>k we have limsââââ¨yiâflsââ˛â,ĎâŠ=limsââââ¨flsââ˛â,yiâĎâŠ=0. Since flsââ˛âââTVâ we obtain â¨TVâ,yiâĎâŠ=â¨yiâTVâ,ĎâŠ=0 for any ĎâRange[OMâ˛â]âŠS(Rn). But the denseness of Range[OMâ˛â]âŠS(Rn) in S(Rn) implies that yiâTVâ=0.
Using lemma 3 and (TVâ)VTâ=T we obtain TâGkâ˛â.
Thus, we proved that TfiâââTâGkâ˛â.
Since I(fiâ)â¤I(fiâ)+ÎťiâR(fiâ)â¤fâGkâ˛âinfâI(f)+Ďľiâ and I is continuous, we finally get that I(T)â¤fâGkâ˛âinfâI(f), i.e. TâArgfâGkâ˛âminâI(f).
â
D.0.2 Proof of Theorem 6
Proof.
Suppose fââArgfâGkâ˛âminâI(f)âBkâ,
i.e. fââBkâ and I(fâ)=fâGkâ˛âminâI(f). Since fââBkâ, then there exists a sequence {si}âGkâ such that Tsiâââfâ and isupâTrMsiâ<â.
Let us define sĎiââS(Rn) as
TsĎiââ=TsiââGĎnâ.
Since limĎâ0âR(sĎiâ)=0 (lemma 4), there exists Ďiâ>0, such that R(sĎiâ)<i1â whenever 0<Ďâ¤Ďiâ. Also, by definition TrMsiâ=limĎâ0âTrMsĎiââ. Therefore, there exists Ďiâ˛â>0, such that TrMsĎiââ<TrMsiâ+1 whenever 0<Ďâ¤Ďiâ˛â.
If we set Ďiââ=min{Ďiâ,Ďiâ˛â,i1â}, then a sequence {sĎiââiâ}âS(Rn) satisfies
[TABLE]
and (using lemma 2)
TsĎiââiââââfâ.
Due to the continuity of I we have limiâââI(sĎiââiâ)=I(fâ).
Now we set fiâ=sĎiââiâ, Îťiâ=R(fiâ)â1â and we obtain the needed sequence:
[TABLE]
where TrMfiââ is bounded. It remains to check that our sequence regularly solves (7), i.e. limiâââfâS(Rn)infâI(f)+ÎťiâR(f)=I(fâ) (this will imply limiâââI(fiâ)+ÎťiâR(fiâ)âfâS(Rn)infâI(f)+ÎťiâR(f)=0). The inequality in one direction is obvious,
[TABLE]
Let us prove the inverse inequality.
Since rsol(I(f),R(f))î =â , there exists {f~âiâ}âS(Rn) such that
[TABLE]
and a=limiâ+ââTf~âiââ. From theorem 5 we obtain aâArgfâGkâ˛âminâI(f).
One can always find a subset {Îť~diââ}â{Îť~iâ} such that Îť~diââ<Îťiâ, Îť~diââââ and obtain
[TABLE]
Therefore,
[TABLE]
This proves that {fiâ} regularly solves (7) and limiâââfiâ=fâ i.e. fâârsol(I(f),R(f)).
â
Appendix E The alternating scheme in the dual space for M(x,y)=Îś(xây)
When M(x,y)=Îś(xây), the alternating scheme 1 allows for a reformulation in the dual space. By this we mean that in Scheme 1 we substitute Ď^âtâ for the original Ďtâ. If the primal Scheme 1 deals with operators SĎâ,SĎtâ1ââ, the dual version deals with vectors of functions Îś^âââxâĎ^ââ,Îś^âââxâĎ^âtâ1ââ.
The substitution is based on the following simple fact:
Theorem 13**.**
If M(x,y)=Îś(xây),Îś,Îś^ââC(Rn) and âxÎś^â(x)>0, then there exist constants c1â and c2â such that âĽSĎââPtâ1âSĎtâ1âââĽâ2â=c1ââĽâĽâxâĎ^âââPtâ1ââxâĎ^âtâ1âââĽ2ââĽL2,Îś^ââ(Rn)2â and
â¨xiâfâŁMâŁxjâfâŠ=c2ââ¨âxiââf^ââ,âxjââf^âââŠL2,Îś^ââ(Rn)â
Proof.
Let f:RnâC be such that âĽxiâfâĽL2â(Rn)â<â.
[TABLE]
Since Sfâ[Ď]iâ=Reâ¨(Sfâ)iâ,ĎâŠâReâ¨(Sfâ)iââ,Ď^ââŠ, we obtain
[TABLE]
where Îş is a constant.
Let us now introduce a vector of functions Vfâ=[(Sfâ)1â,âŻ,(Sfâ)nââ]TâL2nâ(Rn).
Using 105 we obtain (Sfâ)iââ=κΜ^âââxiââf^ââ, and therefore Vfâ=κΜ^âââxâf^ââ.
Thus, the expression âĽSĎââPtâ1âSĎtâ1âââĽâ2â in the alternating scheme can be rewritten as
[TABLE]
The matrix Mfâ can also be calculated from f^â using the following identity:
[TABLE]
â
Let us introduce a function I~ such that I~(f)=I(f^â).
Then, we see that all steps in Scheme 1 can be performed with Ď^âtâ rather than with Ďtâ, using the algorithm 3.
Informally, the dual algorithm works as follows: at each iteration t we compute a function Ď^âtâ adapting it to data (the term I~(Ď^â)) and adapting its gradient field to the rank reduced gradient field of the previous Ď^âtâ1â. For a sufficiently large T, it will converge and Ď^âTââĎ^âTâ1â. Then, the second term in the last step will be approximately equal to ÎťâĽâĽâxâĎ^âTâââPTâ1ââxâĎ^âTâââĽ2ââĽL2,Îś^ââ(Rn)2â, enforcing âxâĎ^âTâââPTâ1ââxâĎ^âTââ for random xâźâĽÎś^ââĽL1ââÎś^ââ.
Thus, gradients âxâĎ^âTââ lie in a k-dimensional subspace colPTâ1â. This last property is a characteristic property of functions from Fkâ.
Absolutely analogously to the Algorithm 3, one can construct a dual algorithm that deals with inverse Fourier transforms of functions, i.e. with Fâ1[Ď], Fâ1[Ďtâ], Mtâ=[Reâ¨âxiââFâ1[Ďtâ]â,âxjââFâ1[Ďtâ]ââŠL2,Fâ1[Îś]â(Rn)ââ] etc. This version of the dual alternating scheme will be used for designing numerical algorithms for the Gaussian MMD-PCA and HM-MMD-PCA.
Let X=[x1â,âŻ,xNâ]. Note that Hfâ=XSXT where S=[P(xiââ xjâ)]i,jâ[N]â. For any UâO(n) and we have
[TABLE]
Let U=[u1â,âŻ,unâ] where {uiâ}i=1nâ are eigenvectors such that Hfâuiâ=Îťiâuiâ. Then, the rotated distribution fUâ is such that HfUââ is diagonal. Note that infνâPkâââĽfUââTνââĽKâ=infνâPkâââĽfâTνââĽKâ.
Therefore, w.l.o.g. we can assume that principal components of Hfâ are e1â,âŻ,enâ and Hfâeiâ=Îťiâeiâ,iâ[n], where {eiâ}i=1nâ is a canonical basis in Rn and Îť1ââĽÎť2ââĽâŻâĽÎťnâ are eigenvalues of Hfâ. From the latter we conclude that â¨xiâfâŁHâŁxjâfâŠ=Îťiâδijâ. Using P(xâ y)=âj=0lâ1âcjââÎąâ(NâŞ{0})n:âŁÎąâŁ=jâÎą!j!âxÎąyÎą, we obtain
[TABLE]
For an input distribution f(x)=N1ââi=1Nâδn(xâxiâ), let us denote fâ˛(x)=N1ââi=1Nâδk(x1:kââ(xiâ)1:kâ)âδnâk(xk+1:nâ), where x=[x1:kâ,xk+1:nâ] and z1:kââRk equals the first k components of zâRn. By construction,
[TABLE]
for i1â,âŻ,isââ[k] and
[TABLE]
whenever ijââ[n]â[k] for at least one jâ[s]. Therefore,
[TABLE]
The second sum F2â equals âi=k+1nâÎťiâ.
Let us compare the first sum, F1â, with the second, F2â.
F1â is a sum of positive factors of (ExâźfâxÎą)2 where Îą1:kâî =0,Îąk+1:nâî =0. Let eiââ(NâŞ{0})n be an ith canonical unit vector and Îąiâ denote an ith component of Îą. The coefficient in front of (ExâźfâxÎą)2 in F1â equals AÎąâ=câŁÎąâŁâ1â(âŁÎąâŁâ1)!âiâ[k]:Îąiâ>0â(Îąâeiâ)!1â=Îą!câŁÎąâŁâ1â(âŁÎąâŁâ1)!ââŁÎą1:kâ⣠and the coefficient in front of (ExâźfâxÎą)2 in F2â equals BÎąâ=câŁÎąâŁâ1â(âŁÎąâŁâ1)!âiâ[n]â[k]:Îąiâ>0â(Îąâeiâ)!1â=Îą!câŁÎąâŁâ1â(âŁÎąâŁâ1)!ââŁÎąk+1:nââŁ. Since âŁÎąk+1:nââŁâĽ1 and âŁÎą1:kââŁâ¤lâ1, we conclude that AÎąââ¤(lâ1)BÎąâ. Therefore, F1ââ¤(lâ1)F2â.
Thus, overall we have
[TABLE]
From âĽTÎźââTνââĽK2â=dMMDâ(Îź,ν)2 the statement of theorem directly follows.
â
In our proof of Theorem 11. we will need the following classical theorem.
Theorem 14** (The Gaussian PoincarĂŠ inequality).**
Let us assume w.l.o.g. that principal components of Hfâ are e1â,âŻ,enâ and Hfâeiâ=Îťiâeiâ,iâ[n], where {eiâ}i=1nâ is a canonical basis in Rn and Îť1ââĽÎť2ââĽâŻâĽÎťnâ are eigenvalues of Hfâ. From the latter we conclude that â¨xiâfâŁHâŁxjâfâŠ=Îťiâδijâ.
Note that H(x,y)=Fâ1[GĎâ](xây).
Let f^â=F[f],
[TABLE]
and fâ˛=Fâ1[fâ˛^â] where x=[x1:kâ,xk+1:nâ]. From the isometry of the Fourier transform, we have
[TABLE]
The latter expression decomposes into two terms. The first is
[TABLE]
The second is
[TABLE]
The latter can be bounded using the Gaussian PoincarĂŠ inequality by
[TABLE]
After changing an order of summations one can bound the internal sum using integration by parts, i.e.
[TABLE]
Then, using the CauchyâSchwarz inequality we bound the latter by
[TABLE]
The first term equals (Cnâ1âÎťjâ)1/2 and the second term, after making the inverse Fourier transform, is bounded by
[TABLE]
where Tiâ=Fâ1[Ď2xi2âGĎââ]=eâĎ2âĽxâĽ2/2(1âĎ2xi2â). Since Tiâ(xây)=H(x,y)(1âĎ2xi2ââĎ2yi2â+2Ď2xiâyiâ), we have â¨xiâxjâfâŁTiâ(xây)âŁyiâyjâfâŠ=â¨xiâxjâfâŁHâŁxiâxjâfâŠâ2Ď2â¨xi3âxjâfâŁHâŁxiâxjâfâŠ+2Ď2â¨xi2âxjâfâŁHâŁxi2âxjâfâŠ. After noting that
[TABLE]
we finally obtain
[TABLE]
where M=O(âŤRnââĽxâĽ2âŁf(x)âŁdx+2âĎâŤRnââĽxâĽ4âŁf(x)âŁdxâŤRnââĽxâĽ2âŁf(x)âŁdxâ+2âĎâŤRnââĽxâĽ3âŁf(x)âŁdx).
By construction, fâ˛âGkâ˛â and it can be approached by elements of Gkâ w.r.t. norm âĽâ âĽKâ. The statement of theorem directly follows from this observation.
â
Appendix G A numerical alternating scheme for the Gaussian MMD-PCA
G.1 Structure of Fâ1[Pkâ]
From theorems 1 and 2, Fâ1[Pkâ]âFkâââ.
In fact, Bochnerâs theorem [50] gives us that the inverse Fourier transform of any positive finite Borel measure is a continuous positive definite function. That is, if fâFâ1[P], then for any distinct y1â,âŻ,ysââRn the matrix
[f(yiââyjâ)]i,j=1,nââ is positive semidefinite. Since Îź(Rn)=1, we additionally have f(0)=1.
Let PDF denote the set of all continuous positive definite functions on Rn and
[TABLE]
Thus, the following characterization of Fâ1[Pkâ] becomes evident.
Theorem 15**.**
Fâ1[Pkâ]=Mkâ.
G.2 The dual form of the Gaussian MMD-PCA
Recall that k(x)=Ghnâ(x). Let us define another Gaussian kernel Îł(x)=eâ2h2âŁxâŁ2â=Fâ1[k].
Let pdataâ(x) denote the characteristic function of the random vector XdataââźÎźdataâ. By definition, pdataâ(x)=E[eiXdataTâx]=N1ââi=1NâeixiTâx.
Thus, pdataâ=Fâ1[Îźdataâ] and Îźdataâ=F[pdataâ].
Using the isometry property of the inverse Fourier transform for L2â(Rn) and the convolution theorem, we see that
[TABLE]
Thus, from Theorem 15 we obtain that the task 17 is equivalent to
[TABLE]
where L2,Îł2â(Rn)=defL2,νâ(Rn) with dν=Îł2dx.
G.3 Algorithms for the Gaussian MMD-PCA
Let Î kâ:Gkââ{1,+â} and Mkâ:Fkââ{1,+â} be simple penalty functions:
From the result of the previous section we see that if I(Ď)=I~(Ď^â), then
[TABLE]
Thus, the Algorithm 4 is an adaptation of Algorithm 3 to MMD-PCA.
If the function pdataâ is real-valued, then only real-valued functions can appear in the Algorithm 4. This assumption can be satisfied by adding reflections of initial points to the dataset (after it was centered).
At step 1, we search over q given in the following parameterized form:
[TABLE]
where Îąiâ>0 and âi=1nnâÎąiâ=1. In our implementation, we set [Îąiâ]i=1,nnââ=softmax([uiâ]i=1,nnââ) and uiââs are unconstrained. The number of neurons in a single layer neural network with a cosine activation function, nn, is a hyperparameter. Let us denote parameters {Ďiâ,uiâ}i=1nnâ by θ. It is easy to see the function qθâ is positive definite. Moreover, using Theorem 2 from [51], it can be shown that a set of all such functions, i.e. the convex hull of {cos(ĎTx)âŁĎâRn}, is dense in a set of real-valued functions from Mkâ. Though this parameterization is quite natural, finding architectures with more expressive power in a space of real-valued positive definite functions is an open problem.
Now, to minimize
[TABLE]
with stochastic gradient descent methods (in our case, the Adam optimizer) we need to have an unbiased estimator of
[TABLE]
where zâźf denotes that the random vector z is sampled according to the probability density function âŤRnâf(x)dxf(x)â.
Thus, a natural estimator of the gradient is
[TABLE]
where {ziâ}i=1mââźiidÎł2 and \{\text{\boldmath\xi}_{i}\}_{i=1}^{m}\sim^{iid}{\hat{\zeta}}.
The last important issue with the practical numerical algorithm is the calculation of Mtâ at step 2. By construction,
[TABLE]
In practice we sample Ď1â,âŻ,ĎlââźÎś^â and estimate Mtâ as
[TABLE]
The details of the numerical algorithm 5 are given below. In all our experiments with MMD-PCA we set Îś^â=Îł2.
Appendix H A numerical alternating scheme for HM-MMD-PCA
H.1 The dual form of HM-MMD-PCA
Due to a well-known relationship between moments of the probability measure Îź and its characteristic function p, i.e. ismi1ââŻisââ=âxi1âââŻâxisâââsp(0)â, the task (20) is equivalent to
[TABLE]
Note that the maximum mean discrepancy distance for the Gaussian kernel and the distance based on higher moments are substantially different. Indeed, even if we set h as a large value (which makes h1ââ0), the MMD distance, unlike the HM distance, neglects higher order derivatives of the characteristic functions in the neigbourhood of the origin. Moreover, from the dual form (135) it is clear that dHMâ(Îźdataâ,ν) is a degenerate case of a weighted Sobolev norm between characteristic functions of Îźdataâ and ν.
H.2 Algorithms for HM-MMD-PCA
Analogously to the case of MMD-PCA we see that the task (20) is equivalent to:
[TABLE]
and
[TABLE]
Thus, the Algorithm 6 is an adaptation of Algorithm 3 to HM-MMD-PCA.
Again, as in a numerical algorithm for MMD-PCA, at step 1, we search over q given in the form (129). The objective of step 1 can be represented as
[TABLE]
where U(1,n) is the discrete uniform distribution over {1,âŻ,n}.
To apply the stochastic gradient descent methods we need to have an unbiased estimator of âθâÎŚ(θ) which is equal to
[TABLE]
Thus, a natural estimator of the gradient is:
[TABLE]
where {a[s,i,j]}s=1,4â,i=1,m1ââ,j=1,sâââźiidU(1,n) and \{\text{\boldmath\xi}_{i}\}_{i=1}^{m_{2}}\sim^{iid}{\hat{\zeta}}. Overall, we obtain the following Algorithm 7.
Appendix I A numerical alternating scheme for WD-PCA
Let us consider the case p=1 and denote W(Îź,ν)=W1â(Îź,ν). By Theorem 12, the task 66 is equivalent to minÎźâPkââW(Îź,Îźdataâ), or to the following task:
[TABLE]
where I(TÎźâ)=W(Îź,Îźdataâ) if ÎźâPkâ and I(Ď)=â, if otherwise.
The alternating scheme 1 is designed to solve the penalty form of the problem, i.e.
[TABLE]
which is equivalent to
[TABLE]
where Spâ(Rn)âS(Rn) is a set of Schwartz functions that can serve as pdf: Ď(x)âĽ0, âŤRnâĎ(x)dx=1.
A numerical version of the alternating scheme requires additional specifications on a) how to minimize over Ď at step 1, and b) how to estimate MĎtââ.
I.1 How to minimize over Ď?
In the case of WD-PCA, the minimization step of the alternating scheme makes the following:
[TABLE]
where Sfâ=OMââ[xf(x)].
For a numerical implementation of that step we need to choose some family of functions that is dense in Spâ(Rn) (or, rich enough to approach the solution Îźâ). Following the tradition of GAN research let us assume that the family is given in the following form333If HâS(Rn) is not satisfied, then we can choose HĎľâ={ĎθââGĎľnââŁÎ¸âÎ} for a very small Ďľ.:
[TABLE]
where {gθââŁÎ¸âÎ} is a parameterized family of smooth functions (usually, a neural network) and p(z) is some fixed distribution (usually, the Gaussian distribution). Following [52], we make the assumption 1. In a numerical algorithm we need an access to a procedure that samples according to Ďθâ(x), not the function itself.
In practice, we choose a family of functions L={fwââŁwâW} and internal maximization is made over wâW with an additional penalty term that penalizes a violation of the Lipschitz condition: âx:âĽfxââĽâ¤1.
A family of minimax algorithms for the minimization of W(Ďθâ,Îźempâ) was developed in a series of papers [52, 53, 54].
The standard minimax scheme that gained popularity in GAN literature iterates two steps: a) niterâ times make a gradient ascent over wâW, b) make a gradient descent over θ.
The task 149 can be viewed as a Wasserstein GAN with an additional regularization term ÎťT(θ) where T(θ)=âĽSĎθâââPtâ1âSĎθtâ1ââââĽ2.
To adapt these algorithms to the minimization of our function, we only need to have an unbiased estimator of the gradient
âθâTâ. This estimator is needed for the generator to make its gradient descent step.
The discriminatorâs part of the algorithm (in which we maximize over Lipschitz functions fwâ) can be set in a standard fashion â we choose [55]âs version, in which the term max{0,âĽâxâfwââ(Ξx+(1âΞ)gθâ(z))âĽâ1}2 enforces Lipschitz condition (see step (*) of the Algorithm 8).
I.2 How to estimate âθâTâ and MĎθtâââ?
Another important aspect of the numerical algorithm is the complexity of estimating the matrix MĎθtâââ at step (**). The following theorem shows that we only need to sample zâźp a sufficient number of times to estimate âθâTâ and MĎθtâââ.
Theorem 16**.**
If Ďθâ is pdf of the random vector gθâ(z), zâźp(z), then
[TABLE]
where
[TABLE]
and RHS is well-defined.
To prove the theorem we need the following lemma first.
If Ez,zâ˛âźpââθâÎ(θ,z,zâ˛)â is well-defined (the proof of sufficiency of that condition is similar to the proof of Theorem 3 from [52]), then, using Leibniz integral rule, we obtain
[TABLE]
The fact that
[TABLE]
is obvious from the definition MĎθââ=Ex,yâźĎθââxyTM(x,y).
â
I.2.1 Definition of H
Specifically, for robust PCA/outlier pursuit applications, we define Ďθâ(x) as a probability density function of the random vector a+b, where a, b are independent and a is the i-th column of matrix θ1ââRnĂN (where iâźU(1,N) is sampled uniformly from {1,âŻ,N}),
b=gθ2ââ(c), câźN(0,Inâ) and gθ2ââ:RnâRn is a neural network with weights θ2â. Thus, θ=(θ1â,θ2â).
It can be checked that H, defined in this way, satisfies the Assumption 1. We specifically introduce the random vector a here because, according to Theorem 12, the ultimate solution of the problem corresponds to θ1â=Y and b=0. This guarantees that the solution is approachable from set H.
Appendix J A numerical alternating scheme for SDR-ORF
For a binary classification case, given a labeled dataset {(xiâ,yiâ)}i=1Nâ, xiââRn,yiââC, C={0,1} we formulate the sufficient dimension reduction problem as the minimization task
[TABLE]
where L(c,y)=âclog(y)â(1âc)log(1ây).
We apply the alternating scheme in the dual space (Algorithm 3) to this task. We set M(x,y)=Îś(xây), where Îś^â is a strictly positive probability density function. A numerical version of the scheme is given below (Algorithm 9).
At every iteration t=1,âŻ,T of the Algorithm 3 we solve the task (in our case I~=J)
[TABLE]
In a numerical version of the algorithm we assume that Ď^â is given as a neural network fθâ, i.e. our task becomes
[TABLE]
The gradient of the function \Phi(\theta)=J(f_{\theta})+\tilde{\lambda}{\mathbb{E}}_{\text{\boldmath\xi}\sim\hat{\zeta}}\|\frac{\partial f_{\theta}}{\partial{\mathbf{x}}}(\text{\boldmath\xi})-P_{t-1}\frac{\partial f_{\theta_{t-1}}}{\partial{\mathbf{x}}}(\text{\boldmath\xi})\|^{2} equals
[TABLE]
That is why âθâL (given to Adam optimizer in the gradient descent loop) in the Algorithm 9 is an unbiased estimator of âθâÎŚ(θ)â. Thus, in the âwhile loopâ we find optimal Ď^âtâ=fθtââ.
According to Algorithm 3, the next goal is to estimate
[TABLE]
It is easy to see that
[TABLE]
From the last we see that the matrix Mtâ can be estimated by sampling \text{\boldmath\chi}\sim\hat{\zeta} a sufficient number of times (the parameter l in our algorithm). All the rest is identical to Algorithm 3.
The regression version of the algorithm can be obtained by setting L(c,câ˛)=(câcâ˛)2.
Implementations for different databases can be found at github.
Bibliography55
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] William B. Johnson, Joram Lindenstrauss, and Gideon Schechtman. Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics , 54(2):129â138, Jun 1986.
2[2] John P. Cunningham and Zoubin Ghahramani. Linear dimensionality reduction: Survey, insights, and generalizations. Journal of Machine Learning Research , 16(89):2859â2900, 2015.
3[3] P.-A. Absil, R. Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds . Princeton University Press, 2009.
4[4] Laurent Schwartz. ThĂŠorie des distributions et transformation de fourier. Annales de lâuniversitĂŠ de Grenoble , 23:7â24, 1947.
5[5] S. Soboleff. MĂŠthode nouvelle Ă resoudre le problème de Cauchy pour les ĂŠquations linĂŠaires hyperboliques normales. Rec. Math. Moscou, n. Ser. , 1:39â71, 1936.
6[6] Qiong Wang, Junbin Gao, and Hong Li. Grassmannian manifold optimization assisted sparse spectral clustering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3145â3153, 2017.
7[7] Jiayao Zhang, Guangxu Zhu, Robert W. Heath, and Kaibin Huang. Grassmannian learning: Embedding geometry awareness in shallow and deep learning, 2018.
8[8] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando De Freitas. Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Int. Res. , 55(1):361â387, January 2016.