Fast approximation of orthogonal matrices and application to PCA
Cristian Rusu, Lorenzo Rosasco

TL;DR
This paper introduces a fast approximation method for orthogonal matrices using extended Givens transformations, enabling efficient PCA computations with a good balance of accuracy and speed.
Contribution
It proposes a novel approximation technique for orthogonal matrices based on extended Givens transformations and an efficient greedy algorithm, improving computational speed for spectral methods.
Findings
The method achieves a good trade-off between accuracy and computational speed.
It effectively approximates orthogonal matrices for spectral methods like PCA.
Experimental results demonstrate improved efficiency in PCA applications.
Abstract
We study the problem of approximating orthogonal matrices so that their application is numerically fast and yet accurate. We find an approximation by solving an optimization problem over a set of structured matrices, that we call extended orthogonal Givens transformations, including Givens rotations as a special case. We propose an efficient greedy algorithm to solve such a problem and show that it strikes a balance between approximation accuracy and speed of computation. The approach is relevant to spectral methods and we illustrate its application to PCA.
| DATASET | FULL PCA | PROPOSED ALGORITHM | |||
|---|---|---|---|---|---|
| accuracy | accuracy | speedup(FLOPS) | speedup(TIME) | selection | |
| PENDIGITS | 95 0.7 | 91 2.1 | 1.6 | 1.1 | 1 |
| ISOLET | 92 0.4 | 90 1.0 | 12 | 10.1 | 1 |
| USPS | 95 1.2 | 94 0.8 | 7.7 | 4.7 | 0.64 |
| UCI | 90 1.9 | 87 1.5 | 2.5 | 1.6 | 0.72 |
| 20NEWS | 80 3.1 | 77 2.1 | 3.1 | 2.5 | 0.3 |
| EMNIST digits | 97 2.5 | 95 1.8 | 13 | 11.3 | 0.37 |
| MNIST 8m | 96 2.0 | 94 0.9 | 15 | 13.7 | 0.28 |
| DATASET: | ISOLET | USPS | EMNIST digits | MNIST 8m | |
|---|---|---|---|---|---|
| C language (TIME) | 10.1 | 4.7 | 11.3 | 13.7 | |
| BLAS (TIME) | 4.8 | 2.5 | 5.7 | 6.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Advanced Optimization Algorithms Research · Statistical and numerical algorithms
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Principal Components Analysis
Fast approximation of orthogonal matrices and application to PCA
Cristian Rusu
Faculty of Automatic Control and Computers, University Politehnica Bucharest
Lorenzo Rosasco
LCSL, Universitá di Genova
Massachusetts Institute of Technology and Istituto Italiano di Tecnologia
Fast approximation of orthogonal matrices
and application to PCA
Cristian Rusu
Faculty of Automatic Control and Computers, University Politehnica Bucharest
Lorenzo Rosasco
LCSL, Universitá di Genova
Massachusetts Institute of Technology and Istituto Italiano di Tecnologia
Abstract
We study the problem of approximating orthogonal matrices so that their application is numerically fast and yet accurate. We find an approximation by solving an optimization problem over a set of structured matrices, that we call extended orthogonal Givens transformations, including Givens rotations as a special case. We propose an efficient greedy algorithm to solve such a problem and show that it strikes a balance between approximation accuracy and speed of computation. The approach is relevant to spectral methods and we illustrate its application to PCA.
1 Introduction
Orthonormal transformations play a key role in most matrix decomposition techniques and spectral methods [4]. As such, manipulating them in an efficient manner is essential to many practical applications. While general matrix-vector multiplications with orthogonal matrices take space and time, it is natural to ask whether faster approximate computations (say ) can be achieved while retaining enough accuracy.
Approximating an orthonormal matrix with just a few building blocks is hard in general. The standard decomposition technique meant to reduce complexity is a low-rank approximation. Unfortunately, for an orthonormal matrix, which is perfectly conditioned, this approach is meaningless.
In this work, we are inspired by the fact that several orthonormal/unitary transformations that exhibit low numerical complexity are known. The typical example is the discrete Fourier transform with its efficient implementation as the fast Fourier transform [47] together with other Fourier-related algorithms: fast Walsh-Hadamard transforms [17], fast cosine transforms [33], and fast Hartley transforms [9]. Other approaches include fast wavelet transforms [5], banded orthonormal matrices [40, 44] and fast Slepian transforms [27]. Decomposition of orthogonal matrices into Householder reflectors or Givens rotations [20][Chapter 5.1] are known already. These basic building blocks have been further extended, for example we have fast Givens rotations [36] and two generalizations of the Givens rotations [6] and [35]. In theoretical physics, unitary decompositions parametrize symmetry groups [45], and they are compactly parametrized using -matrices [42] or symmetric positive definite matrices [3]. To the best of our knowledge, none of these factorizations focus on reducing the computational complexity of using the orthogonal/unitary transformations but rather they model properties of physical systems.
Our idea is to approximately factor any orthonormal matrix into a product of a fixed number of sparse matrices such that their application (to a vector) has linearithmic complexity. In this paper, we derive structured approximations to orthonormal matrices that can be found efficiently and applied remarkably fast. We pose the search for an efficient approximation as an optimization problem over a product of structured matrices, the extended orthogonal Givens transformations. These structures extend Givens rotations to also include reflectors with no computational drawback and suggest a decomposition of the main optimization problem into sub-problems that are easy to understand and solved via a greedy approach. The theoretical properties of the obtained solution are characterized in terms of approximation bounds while the empirical properties are studied extensively.
We illustrate our approach considering dimensionality reduction with principal component analysis (PCA). Here the goal is not to propose a new fast algorithm to compute the principal directions, plenty of efficient algorithms for this task [15][20][Chapter 10] [22]. Rather, we aim at constructing fast dimensionality reduction operators. While the calculation of the principal components is a one-off computation, a numerically efficient projection operation is critical since it is required multiple times in downstream applications. The problem of deriving fast projections has also been previously studied. Possible approaches include: fast wavelet transforms [49], sparse PCA [48, 52], structured transformations such as circulant matrices [24], Kronecker products [21], Givens rotations [12, 32, 34] or structured random projections [1, 18]. Compared to these works, we propose a new way to factorize any orthogonal matrix, including PCA directions, into simple orthogonal structures that we call extended orthogonal Givens transformations and which naturally lead to optimization problems that have closed-form solutions and are therefore efficiently computed.
We note that our approach provides new perspectives on the structure of the orthogonal group and how to coarsely approximate it, which might have an impact on other open research questions.
The paper is organized as follows: Section 2 describes the basic building blocks and algorithm that we propose, Section 3 gives the theoretical guarantees for our contributions, Section 4 details the application of our method to PCA projections and Section 5 shows the numerical experiments.
2 The proposed algorithm
Given a orthonormal , the matrix-vector multiplication takes operations. We want to build such that and takes operations. Parametrizations of orthonormal matrices [2] are known, but to be best of our knowledge, the problem of accurately approximating as product of only a few transformations is open. Given a diagonal , in the spirit of previous work minimizing Frobenius norm approximations [30, 31], we consider the problem
[TABLE]
where is and is a diagonal matrix. Choosing while and to be the identity, we simply approximate . We will also use do denote the first columns of . The above general formulation allows to also consider cases where different directions might have different importance. Then, is a set of orthogonal matrices – defined next – that can be applied fast and allow to efficiently but approximately solve (1).
2.1 The basic building blocks
Classic matrix building blocks that are numerically efficiency include circulant/Toeplitz matrices or Kronecker products. These choices are inefficient as they depend on free parameters but their matrix-vector product cost is or even , i.e., they do not scale linearly with the number of parameters they have. Consider the sparse orthogonal matrices
[TABLE]
where the non-zero part (denoted by and ) on rows and columns and . The transformation in (2), with the first option in , is a Givens (or Jacobi) rotation. With the second option, we have a very sparse Householder reflector. These transformations were first used by Rusu and Thompson [37] to learn numerically efficient sparsifying dictionaries for sparse coding. The s have the following advantages: i) they are orthogonal; ii) they are sparse and therefore fast to manipulate: matrix-vector multiplications take only operations; iii) there are two degrees of freedom to learn: (or ) and the binary choice; and iv) allowing both sub-matrices in enriches the structure and as we will see, leads to an easier (closed-form solutions) optimization problem.
We propose to consider matrices that are products of transformations from (2), that is
[TABLE]
Matrix-vector multiplication with takes operations – when is this is significantly better than , while the coding complexity of each is approximately : bits to encode the choice of the two indices, a constant factor for the pair and 1 bit for the choice between the rotation and reflector. The coding complexity of scales linearly with .
We note that Givens rotations have been used extensively to build numerically efficient transformations [11, 19, 30, 31, 32]. However, reflector was not used before. This may be because in linear algebra (e.g. in QR factorization) and in optimization [39] considering also the reflector has no additional benefit: the rotation alone introduces each zero in the QR factorization and the reflector does not have an exponential mapping on the orthogonal manifold, respectively. As we will show, considering both the rotation and the reflector has the advantage of providing a closed-form solution to our problem.
2.2 The proposed greedy algorithm
We propose to solve the optimization problem in (1) with a greedy approach: we keep and all variables fixed except for a single from and minimize the objective function. When optimizing w.r.t. it is convenient to write
[TABLE]
The next result characterizes the Givens transformation minimizing the above norm. We drop the dependence on for ease of notation.
Theorem 1** (Locally optimal ).**
Let and be two matrices. Further, let and then we have
[TABLE]
Let be the SVD of , with
[TABLE]
Then, the extended orthogonal Givens transformation that minimizes is given by
The above theorem derives a locally optimal way to construct an approximation . We iteratively apply the result to find, for each component in (3), the extended orthogonal Givens transformation that best minimizes the objective function (1). The full procedure is in Algorithm 1 and can be viewed in two different ways: i) a coordinate minimization algorithm; or ii) a hierarchical decomposition where each stage is extremely sparse. The proposed algorithm is guaranteed to converge, in the sense of the objective function (1), to a stationary point. Indeed, no step in the algorithm can increase the objective function, since the sub-problems are minimized exactly: we choose the best indices and then perform the best transformation. We note three remarks on the properties of Algorithm 1.
Remark 1** (Complexity of Algorithm 1).**
The computational complexity of the iterative part of Algorithm 1 is and the initialization is dominated by the computation of all the scores which takes . Note that, the s are computed from scratch only once in the initialization phase. After that, at each step we need to recompute the (redo the singular value decompositions) only for the indices currently used (the Givens transformations act on two coordinates at a time). All other scores are update by the same quantity: in (6), the are the same except when or belong to the set . This observation substantially reduces the running time.
Remark 2** (Complexity of applying ).**
When the computational complexity of operations is an upper bound. Since we keep only components, we need be careful not to perform operations whose result is thrown away by the mask . Consider for example a transformation applied to a vector of size projected to a dimensional space. The three operations that take place on the component are unnecessary. Then, after computing , a pass is made through each of the transforms to decide which of two coordinates the computations are necessary for the final result. As we will show, this further improves the numerical efficiency of our method.
Remark 3** (On the choice of indices).**
Algorithm 1 greedily chooses at each step the indices according to (6). Other factors might be considered: i) choosing indices based on previous choices so that only a select group of indices are used throughout the algorithm, or ii) make multiple choices at each step in order to speed up the algorithm.
3 Analysis of the proposed algorithm
We consider , i.e., and therefore . We model the as a random orthonormal matrix with Haar measure [25] updated so that the diagonal is positive. We perform this update because multiplication by a diagonal matrix with entries has no computational cost but it brings closer to . The goal of this section is to establish upper bounds for the distance between and , as a function of and . We first comment on the inherent difficulty of the problem.
Remark 4** (The approximation gap).**
Since the orthogonal group has size , by the pigeonhole principle a random orthogonal matrix as (3) and only degrees of freedom cannot be exactly approximated with less than operations. For our purposes, think either or . Our goal is to show that the fast structures we propose can perform well in practice and have theoretical bounds that guarantee worse case or average accuracy.
Next, we show two approximations bounds depending on the number of Givens transformations (2).
Theorem 2** (A special bound).**
Given a random orthonormal , for large , its approximation from (1) with transformations from (2) obeys
[TABLE]
Theorem 3** (A general bound).**
Given a random orthonormal , for large , its approximation from (1) with transformations from (2) is bounded by
[TABLE]
Theorem 2 shows that, on average, the performance might degrade with increasing . As stated in Remark 4, this is not surprising since the orthonormal group is much larger than the structure we are trying to approximate it with. The next result provides a bound for other values of . In Theorem 3, taking for some positive constant we have that . This means that whenever we will roughly need Givens transformations from (2) to improve the term. Since the proof of the theorem uses only rotations (and furthermore, in a particular order of indices ) we expect our algorithm to perform much better than the bound indicates as it allows for a richer structure (2) and uses greedy steps that maximally improve the accuracy at each step.
The previous theorems consider the Frobenius norm. In the Jacobi iterative process for diagonalizing a symmetric matrix with Givens rotations [20][Chapter 8.4] the progress of the procedure (convergence) is measured using the off-diagonal “norm” .
Theorem 4** (Convergence in the off-diagonal norm).**
Given a orthonormal and a single Givens transformation , assuming we have
[TABLE]
This result shows that, unlike with the Jacobi iterations, monotonic convergence in this quantity is not guaranteed and depends on the relative differences between the diagonal and the off-diagonal entries of . Our method convergence monotonically to a stationary point when we measure the progress in the Frobenius norm.
Remark 5** (The effect of a single ).**
Given a orthonormal and a Givens transformation from (2) we have that: i) is closer to the identity matrix in the sense that makes a positive contribution to the diagonal elements, i.e., ; and ii) if is random with Haar measure and positive diagonal for large .
The above remark suggests a metric to study the convergence of the proposed method: each Givens transformation adds the score to the diagonal entries of the current approximation (and therefore ensures that converges to the identity – the only diagonal orthonormal matrix). By choosing the maximum we are taking the largest step in this direction.
Remark 6** (Evolution of with ).**
Given a fixed , consider the toy construction , i.e., a sub-matrix of a orthonormal matrix where diagonal elements are equal and off-diagonals are two independent truncated standard normal random variables in the interval (since rows and columns of are normalized). Then, for large , by direct calculation we have that the expected score , i.e., as a function of , obeys
[TABLE]
*i.e., the expected decreases on average quadratically with the increase in the diagonal elements.
The remark is intuitive: as increases is more accurate and becomes diagonally dominant, i.e., as , and we do expect to reach lower scores , i.e., few Givens transformations from (2) provide a rough estimation while very good approximations require .*
3.1 Other ways to measure the approximation error
Throughout this paper we use the Frobenius norm to measure and study the approximation error we propose. In this section we discuss this choice and explore a few alternatives.
In the linear algebra literature, a natural way to measure approximation error is through the operator norm. This is especially true in the randomized linear algebra field (where matrix concentration inequalities which bound the operator norm play a central role). Moreover, the review manuscript [46] deals explicitly with some potential issues that might arise from using the Frobenius, instead of the operator, norm in matrix approximations: the discussion in Chapter 6 of [46] entitled “Warning: Frobenius–Norm Bounds”. The text highlights situations where the matrix to be approximated is low rank and corrupted by noise and/or scaling issues are present. While that discussion holds true, in our case we deal with a given perfectly conditioned matrix and its approximation displays the same scaling ( has normalized columns just as ). Consider the following remark.
Remark 7. [Simplifying the Frobenius norm objective function] Consider for simplicity and in (1) and see that , where is the angle between column vectors and (the columns of and , respectively), and are the diagonal elements of .
Therefore, the proposed optimization objective function minimizes the weighted sum of the cosines of the angles between the original columns and their approximations. As such, low-rank or scaling issues cannot arise. The only potential problem is that different columns might have very different approximation errors (i.e., there could exist such that while there could be some for which ). This issue can be mitigated by increasing in (3) or choosing carefully the indices where the proposed transformations operate.
Assume that we consider the operator norm as the approximation error, i.e., we want to minimize in (1). We have the following two results.
Remark 8. [The spectrum of the error matrix] Consider and in (1) then for any orthonormal and the error matrix is normal and has all its eigenvalues on a circle of radius one centered at in the complex plane. As such, we have that .
Theorem 5. [A bound on the operator norm of the error matrix] Consider and in (1) and that for all , then the operator norm of the error matrix obeys where . The bound in Remark 8 is met when .
Remark 8 describes the full spectrum of the error matrix. It is interesting to notice that the upper bound coincides with the expectation result from [13]: as if both and are chosen uniformly at random with Haar measure then almost surely . Theorem 5 describes an upper bound on the operator norm of the error matrix. The bound is tight only for very high values of and therefore is meant to give a qualitative measure of the approximation. As in Remark 7, the key quantity is but now the bound depends on the worst approximation: while the Frobenius norm objective function minimizes the sum of the pairwise distances between the columns of and , when using the operator norm the approximation accuracy depends on the largest distance between the same pairwise columns. The assumption that is not restrictive at all as we expect to be diagonally dominant with positive elements on the diagonal (see Remark 5 and the discussion after Remark 6). We note that in this context the Frobenius norm approach could be used to minimize a proxy for the operator norm. As in Remark 7 we have the choice of we could update these values with each iteration of the proposed algorithm such that where . This choice will encourage the algorithm to improve upon the worst pairwise column approximation (i.e., the highest ). This approach would be in the spirit of an iteratively reweighted least squares algorithm such as [14].
Finally, the last measure of accuracy we consider is the angle between two subspaces as described by [7]. This value can be non-zero only when , a case which is interesting for PCA projections and which we detail in the next section. Following [29], given and , whose columns span two -dimensional subspaces of size , we compute the angles between the two subspaces as
[TABLE]
where are the singular values of and where . We take the principal angle to be the largest angle above, i.e., .
Finally, we would like to note that all the approximation errors we have previously discussed measure (in different ways) how well approaches the identity matrix.
4 Application: fast PCA projections
Consider a training set of -dimensional points and the matrix . Given , PCA provides the optimal -dimensional projection that minimally distorts, on average, the data points. The projection is given by the eigenvectors of the largest eigenvalues of or, equivalently, the left singular vectors of the largest singular values of , that is
[TABLE]
Given the above decompositions we can approximate by , i.e., we keep but we modify the principal components and their singular values, such that we minimize the error given by
[TABLE]
where is , the diagonal matrix is , is of size , is and is zero except for its main diagonal. In this paper, we work with , as opposed to , to keep the relationship with linear, rather than quadratic. In the context of applying our approach to PCA, we use our decomposition on the principal components which we assume are already calculated together with the associated singular values which we may use as weights in (1).
Note that in (13), , which has size , is not necessary and that the two-step procedure is equivalent to computing the projections directly from . Also note that based on (12), we could factor where and are approximations in and , respectively. The difficulty here is the dependency of on which would require a large running time.
With fixed, for we have several strategies: i) set it to the identity, i.e., flatten the spectrum; ii) keep it to the original singular values ; or iii) continuously update it to minimize the Frobenius norm, i.e., get the new “singular values” that are optimal with the approximation . The first approach favors the accurate reconstruction of all components while the other approach favors mostly the few leading components only (depending on the decay rate of the corresponding singular values).
4.1 Comparison with the symmetric diagonalization by Givens rotations approach
Because of the locally optimal way the Givens transformations are chosen (see Theorem 1), our proposed factorization algorithm is computationally slower than the Jacobi diagonalization process which chooses the Givens rotations on indices corresponding to the largest off-diagonal entry of the covariance matrix. Furthermore, the Jacobi decomposition uses each rotation to zero the largest absolute value off-diagonal entry and because of this sub-optimal choice needs Givens rotations [10] to complete de diagonalization (more than the needed to fully represent the orthonormal group).
In all other aspects, our approach provides advantages over the Jacobi approach: i) we define a clear objective function that we locally optimize exactly; ii) it is known that the Jacobi process converges slowly when the number of rotations is low [28], which is exactly the practically relevant scenario we have, i.e., ; iii) with the same computational complexity, i.e., terms in the factorization, our proposed approach is always more accurate since we include as a special case the Givens rotations.
4.2 Comparison with structured matrix factorization
Our approach requires the explicit availability of the orthonormal principal directions . Previous methods that factor using only Givens rotations are not applied directly on an orthogonal matrix. These methods rely on receiving as input a symmetric object (e.g., ) and then using a variant of Jacobi iterations for matrix diagonalization [31] or multiresolution factorizations [30] to find the good rotations that approximate the orthonormal eigenspace. Applying Givens transformations directly to on the left, i.e., , cannot lead to the computation of the PCA projections but only to the polar decomposition. On the other hand, when applying Givens rotations on both sides of the covariance matrix, i.e., , then the right eigenspace cancels out in the product (12) and we are able to directly recover (but we need explicitly). Finally, note that the diagonalization process approximates the full eigenspace and cannot separate from the start the principal components because they are solving the following problem
[TABLE]
This formulation is useful to approximate the whole symmetric matrix (or ), but not necessarily the principal eigenspace . To get these we would need to complete the diagonalization process, find the largest entries on the diagonal of and then work backward to identify the rotations that contributed diagonalizing those largest elements. This procedure would be prohibitively expensive. Previous work, e.g. [30, 32], deals with approximating rather than computing PCA. In the same line of work, the one-sided Jacobi algorithm for SVD [16] can also be applied.
5 Experimental results
Given the rather pessimistic guarantees, we tackle problems: how well does Algorithm 1 recover random orthogonal matrices and principal components such that we benefit from the computational gains but do not significantly impact the approximation/classification accuracy. Source code available.111https://github.com/cristian-rusu-research/fast-orthonormal-approximation
5.1 Synthetic experiments
For fixed we generate random orthonormal matrices from the Haar measure [25]. Figure 2 shows the representation error for the proposed method. The plot shows that allowing for the Givens transformations in (2) brings a 17% relative benefit as compared to using only the Givens rotations while, for the same , the computational complexity is the same. The circulant approximation performs worst because it has the lowest number of degrees of freedom, only (computationally, it is comparable with the Givens and proposed approaches for ). Lastly, we can observe that the bound is very pessimistic, especially for these values of . In Figure 2 (left) we show for fixed number of transformations the progress that the proposed algorithm makes with each iteration. It is interesting to observe that the initialization steps (first steps) significantly decrease the approximation error while the other step make only moderate improvements. This indicates that convergence is slow and might take a large number of iterations with little progress made by the latter steps. Also in Figure 2 (right) we show the number of stages in each transformation. A stage is a set of extended orthogonal Givens transformations that can be applied in parallel, i.e., they do not share any indices among them. This is important from an implementation perspective as parallel processing can be exploited to speedup the transformations.
5.2 MNIST digits and fashion
We now turn to a classification problem. We use the MNIST digits and fashion datasets. The points have size (we trimmed the bordering whitespace) and we have training and test points. In all cases, we use the k-nearest neighbors (k-NN) algorithm with , and we are looking to correctly classify the test points. Before k-NN we apply PCA and our proposed method. Results are shown in Figure 4. We deploy two variants of the proposed method: approximate the principal components as if they had equal importance and approximate the principal components while simultaneously also updating estimates of the singular values.
For comparison, we also show the sparse JL [26]. In this case, the target dimension is while the random transformation of size only has three non-zero entries per column. More non-zeros did not have any significant effect on the classification accuracy while increasing (doubling, in this case) the target dimension increases the classification accuracy by 10%. The results reported in the plots are averages for 100 realizations and the standard deviation is below 1%. Of course, the significant advantage of the JL approach over PCA is that no training is needed. The disadvantage is that if we choose greedily the target projection dimension, i.e., low , the accuracy degrades significantly for JL. Results are identical for PCA when is of .
Since we are dealing with image data, we also project using the discrete cosine transform (DCT). For the digits dataset, the performance is poor but, surprisingly, for the fashion dataset, this approach is competitive given the large speedup it reports (we used the sparse fast Fourier transform [23] as we want only the largest components). We have also performed the projection by fast wavelet transforms with ‘haar’ and several of the Daubechies ‘dbx’ filters but the results were always similar to that of the DCT. For clarity of exposition, we did not add these results to the figures.
Our proposed methods report a clear trade-off between the classification accuracy and the numerical complexity of the projections. If we insist on an accuracy level close (within 1–2%) of the full PCA then the speedup is only about x3. Reasonable accuracy is obtained for a speedup of x4–x5 after which the results degrade quickly. For better performance seems impossible via randomization. In this figure, the speedup is measured in terms of the number of operations (FLOPS).
Finally, in Figure 7 (left) we compare our proposed methods against the sparse PCA on MNIST digits. sPCA performs exceptionally well in terms of the classification accuracy given the computational budget (a similar result is replicated for MNIST fashion). On other datasets where the principal components capture some global features (not local like in our example) we expect this performance to degrade. The training time of sPCA exceeds by 60% the running time of PCA plus that of our method. We used the implementation of Wang et al. [48].
Because of ideas in Remark 1, we perform only the finally useful calculations and further reduce the computational cost on average by one third (these are accounted for already in the numbers in the plots).
5.3 Experiments on other datasets
The 20-newsgroups dataset consists of articles from 20 newsgroups (approximately 1000 per class). The data set was tokenized using the rainbow package (www.cs.cmu.edu/ mccallum/bow/rainbow). Each article is represented by a word-count vector for the common words in the vocabulary. For this dataset we have , , and as shown in Figure 7 (right) in this case we outperform sparse PCA.
We also apply our algorithm to several other popular dataset from the literature: PENDIGITS (www.ics.uci.edu/$\sim$mlearn/MLRepository.html) with 10 classes and , , , ; ISOLET (archive.ics.uci.edu/ml/datasets/isolet) with 26 classes and , , , ; USPS (github.com/darshanbagul/USPS_Digit_Classification) with 10 classes and , , , ; UCI (ftp.ics.uci.edu/pub/machine-learning-databases/optdigits) with 10 classes and , , , ; EMNIST digits (nist.gov/itl/iad/image-group/emnist-dataset) with 10 classes and , , , ; MNIST 8m (leon.bottou.org/papers/loosli-canu-bottou-2006) with 10 classes and , , , . All the results are shown in Table 1. Here we provide two measures for the speedup: the FLOPS (number of arithmetic operations, additions and multiplications) and the actual running time (in seconds). For the time speedup, we first computed and then implemented the matrix-vector multiplication and compared it to the generic matrix-vector multiplication with , both compiled in C using the gcc compiler and –O3 flag. Once computed by the proposed algorithm, the transformation could also be hardcoded and a further speedup improvement could be achieved. The running time speedups are slightly below the FLOPS speedups due to overhead in modern CPUs (fetching data, register loading etc.). Aside the computational benefits of the proposed transformations we note that while complex instruction set computing machines are closing the speedup gap in terms of running time the price is higher energy consumption and more complex execution pipelines/circuitry.
In Figure 4, for the USPS dataset, we provide insights into the behavior of the proposed algorithm with each iteration and the number of stages in each transformation that we construct. Results are similar to the ones shown for the random orthogonal approximation. We observe again that the initialization step is very efficient in reducing the approximation error (especially when compared to the iterative process that follows) and that the number of stages that have to be applied sequentially can further improve the running time when implementing the proposed transformations.
Finally, one of the most famous applications of PCA is in the field of computer vision for the problem of human face recognition. The eigenfaces [41] approach was used successfully for face recognition and classification tasks. Here, we want to reproduce the famous eigenfaces by using the proposed methods. The original eigenfaces and their approximations (sparse eigenfaces [51]), with different and therefore different levels of detail, are shown in Figure 7.
5.4 Results on other approximation errors
In Section 3.1 we have explored several other ways to measure the approximation accuracy of the proposed algorithm. We note that in the proposed method we keep the Frobenius norm objective function but we also measure the operator norm, the angle between subspaces distance and the average/minimum/maximum correlations. All results for three datasets are shown in Figure 5, the choices of parameters are the same as in Section 5.3. The first observation is that the approximation errors increase with larger (as already clear from Remark 4 and Theorem 3), i.e, best results are obtained for UCI () and the worst for MNIST (). Second, note that improving the Frobenius norm error we also improve all the other error measures. Thirdly, note the similar behavior between the operator norm and subspace distance errors. It is interesting to observe that the two plots seem to suggest experimentally that which hold over a large array of number of Givens transformations . Lastly, the two right-most plots highlight the connection between the operator norm optimization and the maximization of the lowest coherence between columns and their approximations (as per Theorem 5).
Depending on the application at hand we can choose how to measure the approximation error.
5.5 Details of the implementation
The source code attached to this paper is written for the Matlab environment. Besides the implementation of Algorithm 1, we also provide the code to efficiently perform the matrix-vector multiplication with the proposed and , respectively (also taking into account Remark 3). Still, due to the characteristics of Matlab, our implementation is not faster than the dense matrix-vector multiplication “*”. To provide an appropriate comparison, we implement the matrix-vector multiplication with our structures in the compiled lower-level programming language C. We provide comparisons of this implementation against two scenarios: a vanilla matrix-vector multiplication (using the appropriate ordering of the loops to exploit locality) and the Basic Linear Algebra Subprograms (BLAS) Level 2 routines for matrix-vector multiplication (SGEMV). We use the single thread variant of BLAS as for the dimensions we consider the overhead of the parallel implementation is significant.
Applying the proposed transformations on batch features, i.e., matrix-matrix multiplications, would require careful application of the proposed transformations (2) to fully use the computing architecture and maximize performance.
6 Conclusions
This paper proposes a new matrix factorization algorithm for orthogonal matrices based on a class of structured matrices called extended orthogonal Givens transformations. We show that there is a trade-off between the computational complexity and accuracy of the approximations created by our approach. We apply our method to the approximation of a fixed number of principal components and show that, with a minor decrease in performance, we can reach significant computational benefits.
Future research directions include strengthening the theoretical guarantees since they are way above what we observe experimentally. Further, it would be of interest to improve the complexity of the proposed algorithm either by a parallel implementation or using randomization (e.g. computing a random subset of the scores). As an immediate application, it would be interesting to apply our decomposition to the recently proposed unitary recurrent neural networks [50].
Acknowledgment
This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Universita di Genova (IT). Lorenzo Rosasco acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development). Cristian Rusu acknowledges support by the Romanian Ministry of Education and Research, CNCS-UEFISCDI, project number PN-III-P1-1.1-TE-2019-1843, within PNCDI III.
Supplementary materials
More on Remark 3. We say that the orthonormal group has degrees of freedom as it was established [2] that any orthonormal matrix can be factored into a product as
[TABLE]
where is a diagonal matrix with entries only in and the are Givens rotations, with angles , i.e., we have and in (2). To generate a random orthonormal we therefore need to generate random (which are with equal probability) and random angles . These angles are mutually independent and it is known that their joint density function is a random variable
[TABLE]
We define the beta random variable (and ) with density
[TABLE]
Because we are interested in the computational complexity of a random orthonormal matrix we focus on the following three special cases: i) do nothing: ; ii) permute coordinates: ; and iii) . The first two cases perform no operations while the last performs only 4 (as compared to 6 for a general rotation). Unfortunately, the joint density in (16) does not seem to have a simlpe closed-form expression. As such we show in Figure 8 numerical results of the probability distribution of over all angles , which we observe numerically that approaches an exponential distribution . Concentration around the three special cases does not occur and therefore a random orthonormal matrix will generally have computational complexity . For example, if we discretized the continuum of then the probability that a random is an approximate permutation matrix and therefore basically exhibits no numerical complexity is for , i.e., the probability that all rotations have .
Proof of Theorem 1. For simplicity of exposition we will drop the sub-index herein and therefore (4) develops to the following
[TABLE]
where the Frobenius norms reduces to the norms of the spectra and we have used the circular permutation property of the trace. It is convenient to denote and the matrix . Given that performs operations only on rows and , the trace is
[TABLE]
To minimize the quantity in (18) we have to maximize (19) which is known as a Procrustes problem [38] whose solution is given by the polar decomposition of detailed in [20][Chapter 9.4.3]. Therefore, we set the optimal transformation to
[TABLE]
where use the SVD of ( are the singular values). With this choice, we have
[TABLE]
We denote the nuclear norm , i.e., the sum of the singular values and and we define
[TABLE]
Intuitively, the results (18), (19) follows after observing that: 1) the can be viewed as a perturbed identity matrix; 2) if is exactly then while if is the optimal orthonormal transformation that minimizes (18) given by the Procrustes solution [38] then we have , where the last term is the nuclear norm of ; 3) therefore, we actually apply the identity transformation on all coordinates, i.e., the term, while for the two chosen coordinates we apply the best (in the sense of reducing the error) orthogonal transformation whose contribution is the nuclear norm term and then correct for the trace term that was wrongly added initially in , by subtracting .
There are quantities but they can be computed efficiently by noting that the singular values of are . Observe that both singular values are of the form which can be written as where and . Written like this we can see that . Depending on the sign of the determinant we have either or , respectively. These give the final formulas for from (5).
Proof of Theorem 2. Assume is even and partition the set into pairs of indices . We therefore have
[TABLE]
As a side note, to maximization of this partitioned quantity is related to the weighted maximum matching algorithm (of maximum-cardinality matchings) on the graph with nodes and with edge weights . With this choice of indices, the objective function becomes
[TABLE]
We use the singular value decomposition of a generic , and develop:
[TABLE]
where we have use the circular property of the trace and where are its diagonal entries which obey . We define the diagonal coherence as
[TABLE]
With (25), we can state that for the transformation that
[TABLE]
where we have used the fact that and have the structure in (2) and therefore
[TABLE]
Finally, given an orthonormal of size , we use (24) and (27) to bound
[TABLE]
where we have used that because the diagonal elements of can be viewed as Gaussian random variables with zero mean and standard deviation (as the columns of are normalized in the norm) [43] and because the norm of a standard Gaussian random vector of size is .
When is odd, we extend the matrix with a zero column/row and “1” on the diagonal and the argument follows in the same way. In (24) we could use the expected value calculated in (37) but we reach a worse, lower, constant in (28) for the term and therefore a worse overall bound.
Proof of Theorem 3. Given the orthonormal , by [20][Theorem 5.2.1], we can construct its QR factorization using a set of Givens rotations [20][Chapter 5.2.5]. After introducing zeros in the first columns of , by left multiplication with Givens rotations, we reach its following partial triangularization
[TABLE]
where the diagonal matrix of size has entries and of size is orthonormal. To introduce the zeros on the column we need Givens rotations and therefore to bring to the structure in (29) we need Givens rotations which we have denoted . We are exploiting the fact that the triangularization of an orthogonal matrix leads to a diagonal. Then we might consider a good approximation to the product . where with and is taken from (29). The goal of the diagonal matrix is to ensure that the product in has a nonnegative diagonal. Then, given transforms we can bound
[TABLE]
If we consider in (2) instead of the rotations then the quantity on the right becomes an upper bound, since Givens rotations are a special case of – we can always initialize the of Algorithm 1 with the defined above and the iterative procedure is guaranteed not to worsen the factorization. Therefore, the result follows after using .
Proof of Theorem 4. First, we introduce the off-diagonal “norm”, i.e., the square-root of the squared sum of the off-diagonal elements of an orthonormal matrix as
[TABLE]
Better approximations of lead to lower , as approaches the identity. If we use this measure, we reach the following result.
We use the fact that is added to the diagonal of , but only to and . The quantity is minimized for when (and therefore ) which leads to
[TABLE]
We now use de explicit formulas for the singular values of a matrix and the fact that
[TABLE]
and expand this expression to get in (32)
[TABLE]
Therefore, to guarantee we need which is equivalent to
[TABLE]
In this paper we assume that is taken randomly from the Haar measure [25] and then modified to have positive diagonal. Therefore, we have for all by construction, otherwise we would just consider the update . Moreover, as the algorithm progresses we continue to have that because each adds a positive amount (the value) to the diagonal elements and thus converging towards in the sense of (36).
In a similar way we can construct a lower bound. Assuming w.l.o.g. that , the quantity is maximized when and therefore which, similarly to (32), leads to
[TABLE]
More on Remark 4. Understanding the properties of these is of crucial importance for our approach. First, notice the effect one generalized Givens transformation has: we have , which together with (18) leads to
[TABLE]
In this sense, the “pushes” towards the identity matrix by “contributing” to the diagonal of , i.e., we are estimating the inverse of which in this case is just the transpose.
Notice that . The minimum is achieved for symmetric positive semidefinite matrices (because in this case the eigenvalues and singular values are the same and therefore the nuclear norm equals the trace) and the maximum for . This immediately leads to a local optimality condition for our approach: there is no to improve the approximation if all are symmetric positive definite. Assume now we are given a random orthogonal matrix. Because the singular values depend on the entries of which we model as Gaussian random variables with zero mean and standard deviation [43], we have by direct calculation that and which leads to . The trace is the sum of two absolute value Gaussian random variables and therefore which leads to
[TABLE]
More on Remark 5. To see how the scores depend on the diagonal entries of our toy model, by direct calculation with the truncated standard Gaussian random variables we have that
[TABLE]
We show in Figure 9 the empirical results (mean and standard deviation of ) on the toy matrix for . The empirical mean follows the approximation in Remark 5 (and is tight for ) while we notice that the variance is high for almost the whole interval. In Figure 10 we show the average (left) and maximum (right) costs achieved for another toy model where the diagonal elements are distinct for .
Finally, notice that when we have if and zero otherwise, and when we have indicating that skew symmetric sub-matrices have higher than symmetric ones, in general.
Proof of Remark 7. We use the fact that columns of and have unit norm, and the fact that the Frobenius norm is entrywise:
[TABLE]
Proof of Remark 8. First, note that given any proper norm which is invariant to orthonormal transformations, we have that and we denote the error matrix . Now, start from the error matrix and use the fact that is orthonormal to notice by straightforward calculation that:
[TABLE]
Denote now a complex-valued eigenvalue of by , then because of the equality above we have that which in turn can be written as or . This is the equation of a circle of radius one centered at in the complex plane. Another way to view this result is to observe that is orthonormal an therefore its spectrum is on the unit circle and then shifts the whole spectrum to the right by one unit.
Moreover, we also have that and therefore which means that is a normal matrix. As such, the singular values of are the absolute values of its eigenvalues and therefore .
Proof of Theorem 5. The proof is based on the Gershgorin circle theorem (detailed in Chapter 7.2 of [20]) applied to the error matrix . If we denote , we then have for the eigenvalue of that
[TABLE]
The last inequality on the right hand side comes from the inequality applied to the vectors of size (the rows of except for their diagonal elements and whose norm is ). Returning to (41), for the term on the left hand side we have by the reverse triangle inequality () that
[TABLE]
Finally combining (41) and (42) we have that
[TABLE]
This inequality holds for and . The expression on the right-hand side is monotonic decreasing for . As such, the largest upper bound happens for . And because, as shown in Remark 8, the error matrix is a normal matrix we get a bound on the operator norm (the singular values of a normal matrix are the absolute values of its eigenvalues).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ailon and Chazelle [2006] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing , pages 557–563, 2006.
- 2Anderson et al. [1987] T. Anderson, I. Olkin, and L. Underhill. Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing , 8(4):625–629, 1987.
- 3Barvinok [2006] A. Barvinok. Approximating orthogonal matrices by permutation matrices. Pure and Applied Mathematics Quarterly , 2:943–961, 2006.
- 4Belabbas and Wolfe [2009] M.-A. Belabbas and P. J. Wolfe. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences , 106(2):369–374, 2009.
- 5Beylkin et al. [1991] G. Beylkin, R. Coifman, and V. Rokhlin. Fast wavelet transforms and numerical algorithms I. Communications on Pure and Applied Mathematics , 44(2):141–183, 1991.
- 6Biloti et al. [2013] R. Biloti, L. C. Matioli, and J. Yuan. A short note on a generalization of the Givens transformation. Computers & Mathematics with Applications , 66(1):56–61, 2013.
- 7Björck and Golub [1973] Å. Björck and G. H. Golub. Numerical methods for computing angles between linear subspaces. Mathematics of Computation , 27(123):579–594, 1973.
- 8Borel [1906] E. Borel. Introduction geometrique a quelques theories physiques . Gauthier-Villars, 1906.
