Tensor-Train Parameterization for Ultra Dimensionality Reduction

Mingyuan Bai; S.T. Boris Choy; Xin Song; Junbin Gao

arXiv:1908.04924·cs.LG·August 15, 2019

Tensor-Train Parameterization for Ultra Dimensionality Reduction

Mingyuan Bai, S.T. Boris Choy, Xin Song, Junbin Gao

PDF

Open Access

TL;DR

This paper introduces TTPUDR, a robust tensor-train based dimensionality reduction method that outperforms existing techniques on high-dimensional classification tasks by effectively capturing spatial relations.

Contribution

It proposes a novel tensor-train parameterization for ultra dimensionality reduction, replacing traditional LPP with a Frobenius norm-based objective for improved robustness.

Findings

01

TTPUDR outperforms previous methods in classification accuracy.

02

The tensor-train approach effectively captures spatial relations in high-dimensional data.

03

The model demonstrates robustness against outliers.

Abstract

Locality preserving projections (LPP) are a classical dimensionality reduction method based on data graph information. However, LPP is still responsive to extreme outliers. LPP aiming for vectorial data may undermine data structural information when it is applied to multidimensional data. Besides, it assumes the dimension of data to be smaller than the number of instances, which is not suitable for high-dimensional data. For high-dimensional data analysis, the tensor-train decomposition is proved to be able to efficiently and effectively capture the spatial relations. Thus, we propose a tensor-train parameterization for ultra dimensionality reduction (TTPUDR) in which the traditional LPP mapping is tensorized in terms of tensor-trains and the LPP objective is replaced with the Frobenius norm to increase the robustness of the model. The manifold optimization technique is utilized to…

Tables2

Table 1. Table I: Comparison of evaluation criteria under TTPUDR, LPP and PCA on the Indiana dataset.

Results from the Indiana Dataset
	PCA	LPP	TTPUDR
OA	$0.7907$	$0.7810$	$0.7101$
AA	$0.7983$	$0.8191$	$0.7427$
KC	$0.7613$	$0.7497$	$0.6690$

Table 2. Table II: Comparison of evaluation criteria under TTPUDR, LPP and PCA on the Extended Yale B dataset.

Results from the Extended Yale B Dataset
	PCA	LPP	TTPUDR
OA	$0.4461$	$0.4378$	$0.7557$
AA	$0.4937$	$0.4491$	$0.7731$
KC	$0.4312$	$0.4460$	$0.7491$

Equations40

Z = X \times_{\tilde{p}}^{\tilde{q}} Y

Z = X \times_{\tilde{p}}^{\tilde{q}} Y

Y (i_{1}, i_{2}, \dots, i_{n}) =

Y (i_{1}, i_{2}, \dots, i_{n}) =

\dots U_{n - 1} (:, i_{n - 1}, :) U_{n} (:, i_{n}, :)

A min i, j \sum ∥ A^{⊤} x_{i} - A^{⊤} x_{j} ∥_{2}^{2} s_{ij}

A min i, j \sum ∥ A^{⊤} x_{i} - A^{⊤} x_{j} ∥_{2}^{2} s_{ij}

s.t. A^{⊤} XD X^{⊤} A = I

s_{ij} = {e^{- \frac{∣∣ x _{i} - x _{j} ∣ ∣ _{F}^{2}}{t}}, 0, if x_{i} \in N_{k} (x_{j}) or x_{j} \in N_{k} (x_{i}) otherwise

s_{ij} = {e^{- \frac{∣∣ x _{i} - x _{j} ∣ ∣ _{F}^{2}}{t}}, 0, if x_{i} \in N_{k} (x_{j}) or x_{j} \in N_{k} (x_{i}) otherwise

XL X^{⊤} a = λ XD X^{⊤} a

XL X^{⊤} a = λ XD X^{⊤} a

t_{i} = L^{⊤} (U) V (X_{i})

t_{i} = L^{⊤} (U) V (X_{i})

U_{1}, U_{2}, \dots, U_{n} min

U_{1}, U_{2}, \dots, U_{n} min

s.t. L^{⊤} (U_{k})

s_{ij} = \frac{s _{ij}}{∥ L ^{⊤} ( U ) V ( X _{i} ) - L ^{⊤} ( U ) V ( X _{j} ) ∥ _{F}}

s_{ij} = \frac{s _{ij}}{∥ L ^{⊤} ( U ) V ( X _{i} ) - L ^{⊤} ( U ) V ( X _{j} ) ∥ _{F}}

U_{1}, U_{2}, \dots, U_{n} min \frac{1}{2} i, j = 1 \sum N ∥ L^{⊤} (U) V (X_{i}) - L^{⊤} (U) V (X_{j}) ∥_{F}^{2} s_{ij}

U_{1}, U_{2}, \dots, U_{n} min \frac{1}{2} i, j = 1 \sum N ∥ L^{⊤} (U) V (X_{i}) - L^{⊤} (U) V (X_{j}) ∥_{F}^{2} s_{ij}

s.t. L^{⊤} (U_{k}) L (U_{k}) = I_{R_{k}} \forall k = 1, \dots n .

T_{1} (k) = U_{1} \times_{3}^{1} \dots \times_{3}^{1} U_{k - 1} \in R^{I_{1} \times I_{2} \times \dots I_{k - 1} \times R_{k - 1}},

T_{1} (k) = U_{1} \times_{3}^{1} \dots \times_{3}^{1} U_{k - 1} \in R^{I_{1} \times I_{2} \times \dots I_{k - 1} \times R_{k - 1}},

T_{n} (k) = U_{k + 1} \times_{3}^{1} \dots \times_{3}^{1} U_{n} \in R^{R_{k} \times I_{k + 1} \times \dots \times I_{n} \times R_{n}}

Y_{k} = (X \times_{1, 2, ..., k}^{1, 2, ..., k} T_{1} (k)) \times_{2, 3, ..., n + 1 - k}^{2, 3, ..., n + 1 - k} T_{n} (k),

Y_{k} = (X \times_{1, 2, ..., k}^{1, 2, ..., k} T_{1} (k)) \times_{2, 3, ..., n + 1 - k}^{2, 3, ..., n + 1 - k} T_{n} (k),

Y_{1} = X \times_{2, ..., n}^{2, ..., n} T_{n} (1) \in R^{R_{1} \times R_{n} \times I_{1} \times N},

Y_{1} = X \times_{2, ..., n}^{2, ..., n} T_{n} (1) \in R^{R_{1} \times R_{n} \times I_{1} \times N},

Y_{n} = X \times_{1, ..., n - 1}^{1, ..., n - 1} T_{1} (n) \in R^{R_{n - 1} \times I_{n} \times N} .

Y_{n} = X \times_{1, ..., n - 1}^{1, ..., n - 1} T_{1} (n) \in R^{R_{n - 1} \times I_{n} \times N} .

U_{1} min V (U_{1})^{⊤} H_{1} V (U_{1}), s.t. L^{⊤} (U_{1}) L (U_{1}) = I_{R_{1}}

U_{1} min V (U_{1})^{⊤} H_{1} V (U_{1}), s.t. L^{⊤} (U_{1}) L (U_{1}) = I_{R_{1}}

U_{k} min V (U_{k})^{⊤} H_{k} V (U_{k}), s.t. L^{⊤} (U_{k}) L (U_{k}) = I_{R_{k}} .

U_{k} min V (U_{k})^{⊤} H_{k} V (U_{k}), s.t. L^{⊤} (U_{k}) L (U_{k}) = I_{R_{k}} .

U_{n} min trace (L^{⊤} (U_{n}) H_{n} L (U_{n})) .

U_{n} min trace (L^{⊤} (U_{n}) H_{n} L (U_{n})) .

O A

O A

K C

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research

Full text

Tensor-Train Parameterization for Ultra Dimensionality Reduction

1st Mingyuan Bai

Discipline of Business Analytics

*The University of Sydney Business School

The University of Sydney

*Camperdown, NSW, Australia

[email protected]

2nd S.T. Boris Choy

Discipline of Business Analytics

*The University of Sydney Business School

The University of Sydney

*Camperdown, NSW, Australia

[email protected]

3rd Xin Song

Discipline of Business Analytics

*The University of Sydney Business School

The University of Sydney

*Camperdown, NSW, Australia

School of Computer Science

*China University of Geosciences

*Wuhan 430074, P. R. China

[email protected]

4th Junbin Gao

Discipline of Business Analytics

*The University of Sydney Business School

The University of Sydney

*Camperdown, NSW, Australia

[email protected]

Abstract

Locality preserving projections (LPP) are a classical dimensionality reduction method based on data graph information. However, LPP is still responsive to extreme outliers. LPP aiming for vectorial data may undermine data structural information when it is applied to multidimensional data. Besides, it assumes the dimension of data to be smaller than the number of instances, which is not suitable for high-dimensional data. For high-dimensional data analysis, the tensor-train decomposition is proved to be able to efficiently and effectively capture the spatial relations. Thus, we propose a tensor-train parameterization for ultra dimensionality reduction (TTPUDR) in which the traditional LPP mapping is tensorized in terms of tensor-trains and the LPP objective is replaced with the Frobenius norm to increase the robustness of the model. The manifold optimization technique is utilized to solve the new model. The performance of TTPUDR is assessed on classification problems and TTPUDR significantly outperforms the past methods and the several state-of-the-art methods.

Index Terms:

tensor, high-dimensional data, dimensionality reduction, locality preserving projections, robustness

I Introduction

The ultra high-dimensional data, attracting great attention from both academia and the industry, have been common in computer vision [1], recommender systems [2], signal processing [3] and neuroscience [4]. In many cases, high-dimensional data are converted from the so-called multi-dimensional data, commonly referred to as tensors or multi-arrays. There exists a great amount of research which scrutinizes the information in the tensors. To avoid the curse-of-dimensionality issue in data-driven learning, research on dimensionality reduction by taking the tensorial structure into account has attracted great interests in literature [5, 6, 7].

Many methods are utilized to explore the information in tensors by tensor decomposition methods. A group of them presume and maintain spatial structures in tensors. Three classical methods are the CANDECOMP/PARAFAC (CP) decomposition [8], the Tucker decomposition [9] and tensor-train (TT) decomposition [10]. The TT decomposition offers the most compact capacity by decomposing an $n$ -order tensor in terms of the multiplication of $n$ 3-order core tensors in a chain. The TT decomposition has comparatively lower storage complexity against others with acceptable accuracy. Given its capacity, the TT decomposition can avoid the curse of dimensionality and thus is more appropriate to the analysis of higher-mode tensors or ultra-dimensional vectors.

Although the tensor decomposition methods, especially the TT decomposition, are relatively efficient to form a tensor subspace with sufficient spatial relational information for high-mode tensors, it should be noted that the computational and storage cost of including redundant information in the data is also an issue. A number of dimensionality and feature extraction methods have been proposed and implemented in the past decades. Two of the most powerful, renowned and classical ones are the principal component analysis (PCA) [11] and the locality preserving projections (LPP) [12]. Yet PCA is significantly sensitive to the outliers and focuses more on the global information. LPP, on the contrary, is concerned with the local information of the data and has a lower sensitivity to the outliers than PCA by minimizing the squared Frobenius norm of the distance between the data in the lower-dimensional space. Nevertheless, LPP is still evidently affected by extreme outliers. Therefore, a more robust dimensionality reduction or feature extraction method should be implemented. It is well known that the $\ell_{1}$ -norm is robust to outliers, including the extreme ones, which has been applied in both PCA and LPP. However, the $\ell_{1}$ -norm is not differentiable at every point. Furthermore, minimizing an $\ell_{1}$ -norm objective function with respect to a matrix optimization variable is in substance minimizing each element or column vector of this matrix variable, where these elements or column vectors are the components of this matrix variable. Thus, the spatial relation is not considered sufficiently. An approximation to the $\ell_{1}$ -norm is the Frobenius norm. When minimizing the Frobenius norm objective function, all the components are treated as a whole group and the spatial relations between the components are thus adequately preserved and analyzed. Most of the existing dimensionality reduction methods are applied to high/multi-dimensional data by vectorizing them. This vectorization enlarges the parameter space of the algorithms and neglects the spatial relational information existing in multi-way data. Therefore, the tensor subspace embedded dimensionality reduction methods come on stage. There are already a small number of existing attempts to embed the tensor subspace into the low-dimensional spaces. For example, Tucker LPP (TLPP) [6] embeds the tensor subspace based on the Tucker decomposition into the low-dimensional space under the LPP criterion. The local relation is sufficiently captured, but the accuracy is deteriorated due to the sensitivity to the exceptional outliers and the computational cost also exponentially increases.

In this paper, we propose a dimensionality reduction method with the TT subspace embedded, based on the Frobenius norm to measure the distance. We name our method tensor-train parameterization for ultra dimensionality reduction (TTPUDR). It enables the spatial relational information in the tensor to be efficiently and effectively processed and scrutinized, especially when the tensor is with a large number of modes or dimensions. Even for extreme outliers, the results in terms of accuracy and storage efficiency still appear to be satisfactory. In particular, the storage efficiency is higher than the existing dimensionality reduction methods such as PCA, LPP and TLPP. The main contribution of the paper lies in the following:

The proposed TTPUDR is the first example, intending to fill the research gap mentioned above. The embedded TT subspace can preserve spatial relations in multi/high-dimensional data and achieve lower storage complexity than the Tucker-based subspace in [6]. 2. 2.

We propose to use the Frobenius norm (F-norm) in the tensor-train LPP (TTLPP) objective function to greatly reduce the sensitivity to the outliers, especially the extreme ones, and consider the spatial relations sufficiently. 3. 3.

An efficient algorithm is proposed so that TTPUDR is sustainable and executable for ultra-dimensional data. This is a significant improvement over the approximated pseudo PCA implemented in the state-of-the-art dimensionality reduction method- tensor train neighborhood preserving embedding (TTNPE) [7]. 4. 4.

A number of numerical experiments have been conducted on several real-world datasets. Its performance on these datasets is precisely consistent with the stated contributions and advantages.

II Related Work

As aforementioned, there are a great number of tensor decomposition methods investigating spatial relations in multi-dimensional data, i.e., tensors [8, 9, 10, 13, 14]. The tensor-train (TT) decomposition is relatively most efficient and effective among the above 3 classical methods.

To preserve spatial information within tensors in the dimensionality reduction methods, [6] introduces the Tucker LPP (TLPP) which is LPP based on the Tucker decomposition to analyze the high-dimensional data and has the exponential increase in storage complexity as the number of modes increases.

The other existing dimensionality reduction method which embeds the TT subspace, is the tensor train neighbourhood preserving embedding (TTNPE) [7]. TTNPE solves the exponential explosion on the complexity with the number of modes increasing. However, its robustness to the extreme outliers remains as a concern. Therefore, a dimensionality reduction method for tensors with a large number of modes or dimensions is demanded to propose on the TT subspace and the capability of reducing the sensitivity to the extreme outliers. Our method TTPUDR is thus developed with all the aspects.

II-A Preliminaries

Before introducing the TT decomposition and LPP, the ground definitions, the notations and tensor operations are specified. In this paper, we do not distinguish the dimensions of a tensor and its modes. A classic vector is a tensor of mode 1 or 1-order tensor. Similarly, a matrix is a tensor of mode 2, i.e., a 2-order tensor; and a 3-order tensor can be viewed as a data cubic with three modes.

As the tradition, we denote the scalars by lower-case letters, such as $a$ ; the vectors by the bold lower-case letters, for instance, $\mathbf{x}$ ; the matrices as the bold capital letters, for example, $\mathbf{S}$ . They are all examples of tensors. In general, we use the calligraphic capital letters as the notations for tensors, e.g., $\mathscr{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{n}}$ being an $n$ -order tensor of dimension $I_{i}$ at mode $i$ .

Tensor contraction is defined as the multiplication of tensors along their compatible modes. Let $\mathscr{X}\in\mathbb{R}^{I_{1}\times I_{2}\times I_{3}\times\cdots\times I_{n}}$ and $\mathscr{Y}\in\mathbb{R}^{J_{1}\times J_{2}\times J_{3}\times\cdots\times J_{m}}$ . The tensor contraction is defined as

[TABLE]

where $\tilde{p}\subseteq p=\{1,\cdots,n\}$ and $\tilde{q}\subseteq q=\{1,\cdots,m\}$ are subsets satisfying $\tilde{p}=\{k|I_{k}=J_{k}\}$ and $\tilde{q}=\{k|I_{k}=J_{k}\}$ , respectively. The tensor contraction merges two tensors along the modes with the equal sizes, per se, and $\mathscr{Z}\in\mathbb{R}^{\times_{k\in\tilde{p}^{c}}I_{k}\times_{k\in\tilde{q}^{c}}J_{k}}$ .

We denote the left unfolding operation [7] of $\mathscr{X}\in\mathbb{R}^{I_{1}\times I_{2}\times I_{3}\times\cdots\times I_{n}\times R_{n}}$ as the matrix $\mathbf{L}(\mathscr{X})\in\mathbb{R}^{I_{1}I_{2}I_{3}\cdots I_{n}\times R_{n}}$ where the last mode of the tensor becomes the column indices of the left unfolding matrix and the rest of the modes are the row indices. Similarly, for the right unfolding operation, denoting it as $\mathbf{R}(\mathscr{X})\in\mathbb{R}^{I_{1}\times I_{2}\cdots I_{n}R_{n}}$ . Also, the vectorization of a tensor is denoted by $\mathbf{V}(\mathscr{X})\in\mathbb{R}^{I_{1}I_{2}\cdots I_{n}R_{n}}$ . The F-norm of a tensor can be defined as the $\ell_{2}$ -norm of its vectorization, i.e., $\|\mathscr{X}\|_{F}=\|\mathbf{V}(\mathscr{X})\|_{2}=\sqrt{\sum_{i_{1}=1}^{I_{1}}\sum_{i_{2}=1}^{I_{2}}\cdots\sum_{i_{n}=1}^{I_{n}}\sum_{r_{n}=1}^{R_{n}}x_{i_{1},i_{2},\cdots,i_{n},r_{n}}^{2}}$ , which considers all the elements $x_{i_{1},i_{2},\cdots,i_{n}},i_{1}=1,\cdots,I_{1},\cdots,i_{n}=1,\cdots,I_{n},r_{n}=1,\cdots,R_{n}$ as an entire group and preserves the general spatial relations between elements. Besides $\ell_{1}$ -norm of a tensor is computed as $\|\mathscr{X}\|_{1}=\|\mathbf{V}(\mathscr{X})\|_{1}=\sum_{i_{1}=1}^{I_{1}}\sum_{i_{2}=1}^{I_{2}}\cdots\sum_{i_{n}=1}^{I_{n}}\sum_{r_{n}=1}^{R_{n}}|x_{i_{1},i_{2},\cdots,i_{n},r_{n}}|$ which treats each elements separately and can probably cause the spatial information loss.

II-B Tensor-Train Decomposition

The tensor-train (TT) decomposition is designed for large-scale data analysis [10]. It can achieve a simpler implementation than the tree-type decomposition algorithms [15] which are developed to reduce the storage complexity and avoid the local minima.

The TT decomposition assumes a special structure of a tensor subspace where an $n$ -order tensor is expressed as the contraction of a series of $n$ 3-order tensors. Specifically speaking, any element of an $n$ -order tensor $\mathscr{Y}\in\mathbb{R}^{I_{1}\times I_{2}\times I_{3}\times\cdots\times I_{n}}$ is formed as follows,

[TABLE]

where $\mathscr{U}_{1}\in\mathbb{R}^{1\times I_{1}\times R_{1}}$ , $\mathscr{U}_{k}\in\mathbb{R}^{R_{k-1}\times I_{k}\times R_{k}}$ ( $1<k<n$ ), and $\mathscr{U}_{n}\in\mathbb{R}^{R_{n-1}\times I_{n}\times 1}$ . $R_{k}$ ( $k=1,2,\cdots,n-1$ ) are the tensor ranks. Let $R=\max\{R_{1},R_{2},\cdots,R_{n-1}\}$ and $I=\max\{I_{1},I_{2},\cdots,I_{n}\}$ . Thus, the storage complexity is $\mathcal{O}(nIR^{2})$ for the TT decomposition.

For most of the applications, in order to achieve the computational efficiency and be less information redundant, the researchers often restrict the tensor ranks to be smaller than the size of their corresponding tensor mode, i.e., $R_{k}<I_{k}$ for $k=1,2,\cdots,n-1$ [7].

II-C Locality Preserving Projections

Locality preserving projections (LPP) [12] is to explore and preserve local information of data in the projected lower dimensional space, while the conventional principal component analysis (PCA) [11] favours maintaining global information in data.

Given a set of vectorial training data $\{\mathbf{x}_{i}\}^{N}_{i=1}\subset\mathbb{R}^{P}$ and an affinity matrix of locality similarity $\mathbf{S}=[s_{ij}]$ , LPP intends to seek for a linear projection $\mathbf{A}$ from $\mathbb{R}^{P}$ to $\mathbb{R}^{p}$ such that the following optimization problem is solved to minimize the locality preserving criterion set as the objective function.

[TABLE]

The widely used affinity $\mathbf{S}=[s_{ij}]$ is based on the graph of the neighborhood information in the data as follows [12].

[TABLE]

where $t\in\mathbb{R}_{+}$ is a positive parameter and $\mathcal{N}_{k}(\mathbf{x})$ denotes the $k$ -nearest neighborhood of $\mathbf{x}$ .

Denote $\mathbf{X}=[\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N-1},\mathbf{x}_{N}]$ . The LPP problem (3) indeed can be converted to the following generalized eigenvalue problem to solve the eigenvalues $\lambda$ and eigenvectors $\mathbf{a}$ .

[TABLE]

where $\mathbf{L}=\mathbf{D}-\mathbf{S}$ and $\mathbf{D}$ is a diagonal matrix consisting of the row sum of $\mathbf{S}$ . The columns of the final mapping $\mathbf{A}$ consist of the generalized eigenvectors $\mathbf{a}$ in Equation (4), corresponding to the smallest $p$ eigenvalues $\lambda$ ’s.

LPP is a classical dimensionality reduction method and has been applied in many real cases, for example, computer vision [16] . It captures the local information among the data points and reduces more sensitivity to the outliers than PCA. However, we do observe the following shortcomings of LPP:

LPP is designed for vectorial data. When it is applied to multi-dimensional data, i.e, tensors, there exists potential loss of spatial information. The existing tensor locality preserving projections, i.e., the Tucker LPP (TLPP) [6] embeds the tensor space with a high storage complexity at $\mathcal{O}(nIR+R^{n})$ . 2. 2.

Theoretically, LPP cannot work for the cases where the data dimension is greater than the number of samples. Although this can be avoided by a trick in which one first projects the data onto its PCA subspace, then implements LPP in this subspace111 http://www.cad.zju.edu.cn/home/dengcai/Data/code/LPP.m, this would not work well for ultra-dimensional data with a fairly large dataset as a singular value decomposition (SVD) becomes a bottleneck.

The TT decomposition with a smaller storage complexity at $\mathcal{O}(nIR^{2})$ has been recently applied in the tensor train neighborhood preserving embedding (TTNPE) [7, 17]. Nevertheless, the actual algorithm in TTNPE is only implemented as a TT approximation to the pseudo PCA. To the best of our knowledge, there is no existing dimensionality reduction method which can directly process the tensor data with less storage complexity, i.e., using the TT decomposition in algorithms.

III Methodology

In this section, we propose the tensor-train parameterization for ultra dimensionality reduction (TTPUDR) to fill the research gap aforementioned in Section I. The learning procedure is presented in detail with a summary in the form of pseudo code.

Consider a tensor-train (TT) $\widetilde{\mathscr{U}}=\mathscr{U}_{1}\times^{1}_{3}\mathscr{U}_{2}\times^{1}_{3}\cdots\times^{1}_{3}\mathscr{U}_{n}$ where $\mathscr{U}_{k}\in\mathbb{R}^{R_{k-1}\times I_{k}\times R_{k}}$ , $R_{0}=1$ and $R_{n}=I_{1}I_{2}\cdots I_{n}$ . For a given set of tensor data $\{\mathscr{X}_{i}\}^{N}_{i=1}\subset\mathbb{R}^{I_{1}\times I_{2}\cdots\times I_{n}}$ , we project $\mathscr{X}_{i}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{n}}$ to the vector $\mathbf{t}_{i}\in\mathbb{R}^{R_{n}}$ by a TT parameterized mapping defined as,

[TABLE]

where $R_{n}$ now is the number of components or the dimension of $\mathscr{X}_{i}$ . Denote by $\mathbf{S}=[s_{ij}]$ the similarity based on the graph of the neighborhood of tensor data, which may be defined as used in LPP [12] introduces in Section II. To increase the model robustness towards extreme data outliers and preserve the spatial relations, we design the TTPUDR by modifying the LPP formulation as the following optimization problem using the Frobenius norm objective function instead of applying the squared Frobenius norm or the $\ell_{1}$ -norm,

[TABLE]

The TT decomposition based parameterization for the mapping tensor can preserve or learn the spatial relation in tensor data $\mathscr{X}_{i}$ . However, using the F-norm in Problem (5) makes it more difficult to solve the problem of TTPUDR.

We propose to use a splitting and iterative way to solve the problem. For this purpose, we define

[TABLE]

which is a function of the tensor cores $\mathscr{U}_{1},\mathscr{U}_{2},\cdots,\mathscr{U}_{n}$ . Then we rewrite Problem (5) in terms of the squared F-norm as follows

[TABLE]

Problem (7) seems to be an LPP problem. However, it is not because the modified affinity $\widetilde{s}_{ij}$ is a function of parameters $\{\mathscr{U}_{1},\mathscr{U}_{2},\cdots,\mathscr{U}_{n}\}$ . We solve it in the following way. Suppose Problem (7) is being solved by an iterative optimization algorithm. We use the current parameter values to calculate $\widetilde{s}_{ij}$ according to Equation (6) and then fix all $\widetilde{s}_{ij}$ to solve Problem (7). This alternative procedure can continue until convergence.

To efficiently solve Problem (7) while $\widetilde{s}_{ij}$ fixed, we follow an alternating procedure for solving each tensor core $\mathscr{U}_{k}$ while the rest are fixed. Overall, we solve the TT parameters, i.e., tensor cores, and update the neighborhood graph $\tilde{\mathbf{S}}$ alternately. This learning procedure terminates when the solution converges.

In optimizing each tensor core $\mathscr{U}_{k}$ , we find that the strategy in [17] involves manipulating a matrix $\mathbf{Z}\in\mathbb{R}^{I_{1}I_{2}\cdots I_{n}\times I_{1}I_{2}\cdots I_{n}}$ , which is forbidden when data are ultra-dimension or high-order tensors. By taking the commutative property of the tensor contraction operation, we propose a new strategy which largely speeds up the calculation.

To describe the new algorithm, we define

[TABLE]

where $1\leq k\leq n$ but $\mathscr{T}_{1}(1)$ and $\mathscr{T}_{n}(n)$ are not defined. Let $\mathscr{X}$ be the $(n+1)$ -order data tensor whose mode- $(n+1)$ stacks along the data samples, i.e., $\mathscr{X}\in\mathbb{R}^{I_{1}\times I_{2}\times\cdots\times I_{n}\times N}$ . Then define the partially transformed tensor, for $1<k<n$ of size $R_{k-1}\times R_{k}\times R_{n}\times I_{k}\times N$ ,

[TABLE]

and, for $k=1$ ,

[TABLE]

and, for $k=n$ ,

[TABLE]

Finally, the optimization problem (7) for TTPUDR is transformed to the following subproblems, respectively:

Solving for $\mathscr{U}_{1}$ : For each $1\leq r_{n}\leq R_{n}$ , take the slice $\mathscr{Y}_{1}(:,r_{n},:,:)$ and reshape it as a matrix $\mathbf{Y}_{1}(r_{n})$ of size $(R_{1}I_{1})\times N$ , and form the matrix $\mathbf{H}_{1}=\sum^{R_{n}}_{r_{n}=1}\mathbf{Y}_{1}(r_{n})\widetilde{\mathbf{L}}\mathbf{Y}_{1}(r_{n})^{\top}$ of size $(R_{1}I_{1})\times(R_{1}I_{1})$ . Then $\mathscr{U}_{1}$ is solved by

[TABLE]

Solving for $\mathscr{U}_{k}$ ( $1<k<n$ ): For each $1\leq r_{n}\leq R_{n}$ , take the slice $\mathscr{Y}_{k}(:,:,r_{n},:,:)$ and reshape it as a matrix $\mathbf{Y}_{k}(r_{n})$ of size $(R_{k-1}I_{k}R_{k})\times N$ , and form the matrix $\mathbf{H}_{k}=\sum^{R_{n}}_{r_{n}=1}\mathbf{Y}_{k}(r_{n})\widetilde{\mathbf{L}}\mathbf{Y}_{k}(r_{n})^{\top}$ of size $(R_{k-1}I_{k}R_{k})\times(R_{k-1}I_{k}R_{k})$ . Then $\mathscr{U}_{k}$ is solved by

[TABLE]

Solving for $\mathscr{U}_{n}$ : Reshape $\mathscr{Y}_{n}$ to the matrix $\mathbf{Y}_{n}$ of size $(R_{n-1}I_{n})\times N$ , and form the matrix $\mathbf{H}_{n}=\mathbf{Y}_{n}\widetilde{\mathbf{L}}\mathbf{Y}^{\top}_{n}$ . Then solve $\mathscr{U}_{n}$ satisfying $\mathbf{L}^{\top}(\mathscr{U}_{n})\mathbf{L}(\mathscr{U}_{n})=\mathbf{I}_{R_{n}}$ by

[TABLE]

Each problem in (10) – (12) is an optimization problem over Stiefel manifolds of small dimensions. They can be efficiently solved by manifold optimization package such as ManOpt (http://www.manopt.org).

To sum up, the pseudo code for the entire learning process of TTPUDR is presented in Algorithm 1. Note that there has not been any perfect theoretical proof of the convergence of TTPUDR, but it still achieves the convergence empirically as shown in the experiments in Section IV.

Remark 1: We have added the orthogonal constraints $\mathbf{L}^{\top}(\mathscr{U}_{k})\mathbf{L}(\mathscr{U}_{k})=\mathbf{I}_{R_{k}}$ in Problems (10) - (12). These constrained conditions make sure that the dimensionality reduction mapping $\mathbf{E}=\mathbf{L}(\mathscr{U}_{1}\times_{3}^{1}\mathscr{U}_{2}\times_{3}^{1}\cdots\times_{3}^{1}\mathscr{U}_{n})$ consists of orthogonal columns, by referring to Lemma 2 in [7]. To ease the optimization on the Stiefel manifold in Problems (10) and (11), we can replace the orthogonal condition by $\mathbf{V}^{\top}(\mathscr{U}_{k})\mathbf{V}(\mathscr{U}_{k})=1$ ( $1\leq k<n$ ), resulting in an eigenvalue problem. However, the overall orthogonality will be lost.

Remark 2: Problem (12) is quite different from Problems (10) and (11). Problem (12) is equivalent to the eigenvalue problem of $\mathbf{H}_{n}$ .

Remark 3: The algorithm can be used for dimensionality reduction for ultra-dimensional vectorial data. For example, suppose that the dimension of vector data is $D=I_{1}\times I_{2}\times\cdots\times I_{n}$ , then we can seek for the dimensionality reduction mapping in terms of TT parameterization. This makes dimensionality reduction possible for ultra-dimensional data.

IV Experiments

To validate the proposed TTPUDR method, the experiments on facial recognition and remote sensing are demonstrated in this section. The results are compared with the classical methods and its related methods, i.e., PCA [11] and LPP [12]. All the experiments are conducted on the Windows 10 system with the memory at 128GB and the Intel Core i7 6950X processor for 25M cache and up to 3.50 GHz, with Matlab 2018a version.

IV-A Data Description

The performance of the TTPUDR method is studied through numerical experiments on two high-dimensional datasets from two publicly available databases: the Extended Yale B [18] and the Northwest Indiana’s Indian Pines by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in 1992 [19]. The first two experiments are conducted on the original datasets from these two databases, whereas the third experiment aims to investigate the robust property of TTPUDR on extreme outliers. Therefore, we add the $10\%$ block noises to the Extended Yale B dataset.

The Extended Yale B dataset is on facial of 38 individuals. Each individual has 9 positions and 64 near frontal-face images, resulting in a total of $21888$ images. Each image has been resized to $32\times 32$ pixels. After conducting the rearrangements and removing the missing values, the final number of images is $2414$ .

In terms of the Northwest Indiana’s Indian Pine (Indiana) dataset, it is collected based on the Indian Pines test site in North-western Indiana and contains $145\times 145$ pixels and $224$ spectral reflectance bands in the wavelength range $0.4–2.5\times 10^{(-6)}$ meters. Similar to what is in the Extended Yale B dataset, we choose $200$ spectral reflectance bands and $10366$ pixel locations by eliminating the missing values and the water absorption.

For the noised Extended Yale B dataset, we add the block noise to $10\%$ and $20\%$ of the images for each. The noises are generated as either the minimum value or the maximum value of the Extended Yale B dataset as either [math] or $255$ , whereas the general pixel values are from $9$ to $115$ . They are added as $4\times 4$ blocks to images, which are salt and pepper noises. Their locations are both predefined and random. This dataset is designed to examine the robustness of TTPUDR to the extreme outliers.

To investigate the capability of capturing the spatial structure information, we select the first two datasets in the three datasets above. In these two datasets (no noises added), $60\%$ of the data are considered as the training set and $40\%$ of the data are regarded as the test set. Then to test the robustness and further scrutinize the ability of TTPUDR to process the ultra high-dimensional data, we only utilize the third noised dataset, where $60\%$ and $20\%$ of the data are treated to be the training set and $40\%$ and $80\%$ of the data are set as the test set, respectively. In the case of the noised dataset, the extreme outlier noises are added at $10\%$ and $20\%$ among the training data accordingly.

IV-B Benchmark and Comparison Criteria

The experiments are designed to evaluate the capability to analyze the structured high-dimensional data and the robustness to the extreme outliers of TTPUDR. We compare its performance with existing methods such as PCA and LPP for compatible cases. Note that we are unable to compare with TTNPE since its publicly available program itself is not executable due to its extreme computational complexity. For TLPP, the same issue also exists. For both PCA and LPP, we use the implementation in https://lvdmaaten.github.io/drtoolbox/.

For the classification performance, we use the data after dimensionality reduction as the new features for each object and conduct a classifier fitting. The 1-nearest neighborhood (1NN) classifier is used in our experiments. The evaluation criteria are the overall accuracy (OA), the average accuracy (AA), and Kappa coefficient (KC) for the number of reduced dimensions from 2 to 30, i.e., $R_{n}=2,\cdots,30$ . Specifically, these criteria are computed as

[TABLE]

where $C$ is the total number of classes and $T$ is the number of the test data points.

For robustness to outliers, the evaluation criteria are on the accuracy itself and the convergence speed of the accuracy, for the different proportion of outliers at $10\%$ and $20\%$ . Furthermore, the convergence analysis is conducted based on the four cases mentioned above, but only the case with the fastest convergence speed for TTPUDR is disclosed and compared with the same three methods across all the iterations on the corresponding feature number for TTPUDR.

IV-C Results and Findings

As aforementioned, the experiments on the Indiana dataset and the Extended Yale B dataset are to examine how TTPUDR can capture the spatial information in the high-dimensional data. We also apply the noised Extended Yale B dataset to examine the robustness of TTPUDR. In the first set of experiments, the dimension of the training data is smaller than the number of samples. Another set of experiments on the noised Extended Yale B is intended to further evaluate this ability of TTPUDR on ultra high-dimensional data and its robustness to extreme outliers.

IV-C1 Parameter Compression Capability

In the case with spatial information capturing, the dimension of the data is smaller than the number of samples for the training set. In other words, the assumption of LPP is not violated on the dimension size and the number of samples. On each method for each dataset, we have executed them for 150 iterations, i.e., 10 shuffles of random samples with 15 iterations for each sample. Firstly, the results for the Indiana dataset is presented in Table I.

For the fair comparison, the number of neighbors and the parameter $t$ are set as $4$ and $0.02$ respectively to construct the affinity matrix of locality similarity $\mathbf{S}$ for both LPP and TTPUDR. The sizes of tensor cores in TTPUDR are $1\times 4\times 3$ , $3\times 5\times 4$ and $4\times 10\times R_{n}$ with $R_{n}$ from 2 to 30 as the number of features. The total numbers of model parameters are from 152 to 1272, verse 200 to 6000 for PCA and LPP. Here we only present the case with $R_{n}=24$ which is randomly selected from 2 to 30. The values in each cell of the table are the means of the 10 randomnesses. As this dataset has a larger number of samples and a smaller number of dimensions, the performance of the proposed TTPUDR is less competitive to that of PCA and LPP. On average, OA, AA and KC values under TTPUDR are 10% smaller than those under PCA and LPP.

A similar experiment for the Extended Yale B dataset can be demonstrated in Table II.

To compare TTPUDR with LPP fairly, the number of neighbors and the Heat kernel width parameter $t$ are set as $4$ and $0.5$ respectively to construct the affinity matrix of locality similarity $\mathbf{S}$ for both LPP and TTPUDR. The sizes of tensor cores in TTPUDR are $1\times 4\times 4$ , $4\times 8\times 7$ , $7\times 4\times 4$ and $4\times 8\times R_{n}$ with $R_{n}$ from 2 to 30 as the number of features. The total numbers of model parameters are from 416 to 1312, verse 2048 to 30720 for PCA and LPP. In this case, we randomly choose $R_{n}=28$ to demonstrate. The numbers in Table II are also the best result of each method for each criterion. In this case, the results are based on $R_{n}=28$ features, i.e., dimensions. This case shows that TTPUDR performs better than both PCA and LPP. On average, these values are at least 66% bigger under TTPUDR than LPP and PCA. The presented OA, AA and KC in the table are also the means of those across iterations. This result is not surprising as this dataset has a smaller sample size and a larger dimension than the Indiana dataset, which align with the characteristics of ultra-dimensionality under TTPUDR.

This set of experiments has demonstrated that the TTPUDR uses much fewer model parameters to achieve comparable performance for the classification tasks.

IV-C2 Robustness

Following the parameter compression capability, we examine the robustness of TTPUDR with the noised Extended Yale B dataset. The results are reported in Figure 1. For simplicity, we present OA for TTPUDR, LPP and PCA across dimensions, i.e., features from 2 from 30, since all the three methods have the best performance on this evaluation criterion than the other criteria.

Figures 1a and 1b demonstrate the performance of TTPUDR, LPP and PCA with 60% of training data with 10% and 20% of extreme outlier noise, respectively. From these Figures, it is evident that TTPUDR significantly outperforms LPP and PCA on the overall accuracy. In the case with $10\%$ of the noise, TTPUDR generally achieves better performance at a lower reduced dimensionality although this pace has slightly slowed down in the case of the $20\%$ of extreme outlier noises. Therefore, we can conclude that TTPUDR is capable of capturing sufficient information in the ultra high-dimensional data effectively and efficiently under a lower dimensionality. In both cases of noises, TTPUDR has better performance than both LPP and PCA. This shows that TTPUDR has significantly higher robustness to the extreme outliers due to its adopting the F-norm LPP objective.

In Figures 1c and 1d, we show the results for the case of using $20\%$ training data, resulting 482 samples of 1024 dimensions. Since the number of dimensions is larger than the number of samples, the assumption of LPP is violated. Thus, LPP is not able to execute and there is no result available for LPP. However, TTPUDR can still operate and produce a more satisfactory OA compared with the other benchmark method, PCA. To sum up, TTPUDR has an excellent capability of processing and analyzing the spatial structural information in the ultra high-dimensional data effectively even with a really small number of training data. In terms of the robustness, TTPUDR also has a more preferable performance than the other executable method.

V Conclusions

This paper proposes a tensor-train parameterization for the ultra-dimensionality reduction algorithm. The dimensionality reduction mapping is tensorized to learn and preserve spatial information amongst multi-dimensional data and to increase model robustness towards extreme data outliers. This method has been successfully illustrated in two real datasets. The performance of the method is comparable with the existing methods with less parameters. It also outperforms other competitive models in the case of high-dimension-small-samples and large proportion of data with extreme noises. In the future research, we intend to expand it into a structure which can also capture and analyze the sequential relations in the time series tensor data.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Vasilescu and D. Terzopoulos, “Multilinear analysis of image ensembles: Tensorfaces,” in ECCV , A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, Eds., 2002, pp. 447–460.
2[2] P. Symeonidis, “Matrix and tensor decomposition in recommender systems,” in ACM Rec Sys , 2016, pp. 429–430.
3[3] A. Cichocki, D. Mandic, A. Phan, C. Caiafa, G. Zhou, Q. Zhao, and L. Lathauwer, “Tensor decompositions for signal processing applications from two-way to multiway component analysis,” ar Xiv:1403.4462 , 2014.
4[4] C. F. Beckmann and S. M. Smith, “Tensorial extensions of independent component analysis for multisubject FMRI analysis,” Neuroimage , vol. 25, no. 1, pp. 294–311, 2005.
5[5] W. Wang, V. Aggarwal, and S. Aeron, “Tensor completion by alternating minimization under the tensor train (TT) model,” ar Xiv;1609.05587 , 2016.
6[6] G. Dai and D. Yeung, “Tensor embedding methods,” in AAAI , 2006, pp. 330–335.
7[7] W. Wang, V. Aggarwal, and S. Aeron, “Principal component analysis with tensor train subspace,” ar Xiv:1803.05026 , 2018.
8[8] F. L. Hitchcock, “Multiple invariants and generalized rank of a p-way matrix or tensor,” Journal of Mathematics and Physics , vol. 7, no. 1-4, pp. 39–79, 1928.