TL;DR
This paper introduces a fast, stable, and data-efficient domain adaptation method that finds a domain-invariant subspace using low-rank techniques, improving generalization in tasks like text and image classification.
Contribution
It proposes a novel low-rank subspace override method that computes a domain-invariant subspace in closed form, reducing complexity and data requirements compared to existing approaches.
Findings
Achieves competitive performance on text and image classification tasks.
Requires only a single data snapshot for domain adaptation.
Offers a fast and stable alternative to complex existing methods.
Abstract
Current supervised learning models cannot generalize well across domain boundaries, which is a known problem in many applications, such as robotics or visual classification. Domain adaptation methods are used to improve these generalization properties. However, these techniques suffer either from being restricted to a particular task, such as visual adaptation, require a lot of computational time and data, which is not always guaranteed, have complex parameterization, or expensive optimization procedures. In this work, we present an approach that requires only a well-chosen snapshot of data to find a single domain invariant subspace. The subspace is calculated in closed form and overrides domain structures, which makes it fast and stable in parameterization. By employing low-rank techniques, we emphasize on descriptive characteristics of data. The presented idea is evaluated on various…
| Dataset | Subsets | #Samples | #Features | #Classes |
|---|---|---|---|---|
| Caltech | C | 1123 | 800 (4096) | 10 |
| Office | A,W,D | 1123 | 800 (4096) | 10 |
| Newsgroup | Comp,Rec,Sci,Talk | 4857,3967,3946,3250 | 25804 | 2 |
| Reuters | Orgs,People,Places | 1237,1208,1016 | 25804 | 2 |
| Dataset | SVM | TCA | JDA | GFK | SA | CORAL | CGCA | SCA | EasyTL | JGSA | MEDA | NSO (ours) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Comp vs Rec | 77.6 | 78.9 | 83.1 | 75.1 | 78.5 | 79.4 | 84.0 | 56.1 | 42.2 | 88.3 | 49.1 | 90.2 |
| Comp vs Sci | 71.1 | 62.0 | 75.5 | 64.1 | 80.2 | 71.8 | 73.2 | 72.4 | 25.2 | 78.4 | 49.2 | 98.4 |
| Comp vs Talk | 84.4 | 75.0 | 87.7 | 83.8 | 91.1 | 90.5 | 87.0 | 89.5 | 41.1 | 91.2 | 54.4 | 96.7 |
| Rec vs Sci | 69.3 | 79.6 | 79.0 | 64.4 | 81.1 | 75.0 | 74.0 | 71.4 | 33.8 | 80.5 | 50.0 | 99.0 |
| Rec vs Talk | 74.5 | 86.6 | 82.0 | 72.9 | 79.7 | 81.6 | 77.3 | 78.3 | 41.6 | 80.8 | 55.0 | 96.4 |
| Sci vs Talk | 70.9 | 77.6 | 70.5 | 64.2 | 76.0 | 74.2 | 69.0 | 72.2 | 41.1 | 77.7 | 53.7 | 96.4 |
| Mean | 74.6 | 76.6 | 79.6 | 70.7 | 81.1 | 78.7 | 77.4 | 73.3 | 37.5 | 82.8 | 51.9 | 96.2 |
| Dataset | SVM | TCA | JDA | GFK | SA | CORAL | CGCA | SCA | EasyTL | JGSA | MEDA | NSO (ours) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Orgs vs People | 78.1 | 79.5 | 76.6 | 75.3 | 99.9 | 77.5 | 78.0 | 77.8 | 39.2 | 76.5 | 48.0 | 99.6 |
| People vs Orgs | 79.2 | 82.7 | 80.0 | 71.6 | 99.9 | 78.2 | 78.6 | 79.8 | 37.9 | 74.2 | 47.3 | 98.5 |
| Orgs vs Place | 69.2 | 72.9 | 70.0 | 60.5 | 97.3 | 70.3 | 70.1 | 69.8 | 28.9 | 72.2 | 43.2 | 98.6 |
| Place vs Orgs | 66.3 | 71.1 | 65.6 | 61.5 | 97.2 | 66.5 | 67.7 | 65.3 | 27.0 | 64.4 | 41.4 | 97.2 |
| People vs Place | 55.7 | 57.4 | 57.0 | 57.5 | 97.4 | 57.8 | 57.0 | 57.3 | 22.4 | 52.6 | 40.9 | 97.4 |
| Place vs People | 57.4 | 48.9 | 60.7 | 56.2 | 97.4 | 56.3 | 54.4 | 58.2 | 18.3 | 55.5 | 38.5 | 97.4 |
| Mean | 67.7 | 68.7 | 68.3 | 63.8 | 98.1 | 67.7 | 67.6 | 68.0 | 28.9 | 65.9 | 43.2 | 98.1 |
| Dataset | SVM | TCA | JDA | GFK | SA | CORAL | CGCA | SCA | EasyTL | JGSA | MEDA | NSO (ours) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C vs A | 53.1 | 53.9 | 55.2 | 41.8 | 52.2 | 52.1 | 54.1 | 33.1 | 50.1 | 51.8 | 56.5 | 88.5 |
| C vs W | 41.7 | 42.4 | 46.8 | 40.7 | 18.3 | 38.6 | 43.1 | 24.9 | 49.5 | 46.1 | 53.9 | 81.0 |
| C vs D | 47.8 | 46.5 | 49.7 | 39.5 | 15.9 | 36.3 | 37.6 | 33.1 | 48.4 | 44.6 | 50.3 | 79.0 |
| A vs C | 41.7 | 45.4 | 43.5 | 39.0 | 60.0 | 45.1 | 44.9 | 26.3 | 43.0 | 39.7 | 43.9 | 61.5 |
| A vs W | 31.9 | 37.6 | 44.4 | 36.9 | 29.2 | 44.4 | 43.9 | 27.6 | 40.7 | 46.1 | 53.2 | 81.0 |
| A vs D | 44.6 | 40.1 | 31.2 | 33.1 | 28.0 | 39.5 | 36.3 | 25.5 | 38.9 | 47.8 | 45.9 | 79.0 |
| W vs C | 21.2 | 31.2 | 31.5 | 27.4 | 23.2 | 33.7 | 33.8 | 15.6 | 29.7 | 30.2 | 34.2 | 63.5 |
| W vs A | 27.6 | 34.7 | 31.7 | 31.2 | 29.5 | 35.9 | 37.6 | 21.1 | 35.2 | 40.0 | 42.7 | 95.8 |
| W vs D | 78.3 | 83.4 | 92.4 | 82.8 | 78.3 | 86.6 | 88.5 | 41.4 | 77.1 | 91.1 | 88.5 | 79.0 |
| D vs C | 26.5 | 36.2 | 32.6 | 27.2 | 21.9 | 33.9 | 35.4 | 17.2 | 31.3 | 30.3 | 34.8 | 66.6 |
| D vs A | 26.2 | 37.1 | 36.7 | 30.9 | 26.5 | 37.7 | 38.9 | 17.2 | 31.9 | 38.2 | 40.6 | 93.1 |
| D vs W | 52.5 | 83.1 | 88.5 | 71.9 | 89.8 | 84.7 | 87.1 | 32.5 | 69.5 | 91.5 | 87.5 | 83.1 |
| Mean | 41.1 | 47.6 | 48.7 | 41.9 | 39.4 | 47.4 | 48.4 | 26.3 | 45.4 | 49.8 | 52.7 | 79.3 |
| Traditional Methods | Deep Domain Adaptation | ||||||||||||||||
| Dataset | SVM | TCA | JDA | GFK | SA | CORAL | CGCA | SCA | EasyTL | JGSA | MEDA | NSO (ours) | Alexnet | DDC-MMD | JAN | DAN | Deep-CORAL |
| C vs A | 90.6 | 90.2 | 92.4 | 85.6 | 92.0 | 91.5 | 90.1 | 48.0 | 90.2 | 92.1 | 93.5 | 88.9 | 92.5 | 92.5 | 93.4 | 92.9 | 92.8 |
| C vs W | 79.0 | 78.3 | 81.7 | 76.6 | 73.2 | 78.6 | 75.9 | 35.3 | 76.9 | 86.4 | 93.6 | 81.3 | 74.8 | 74.9 | 85.0 | 86.6 | 84.3 |
| C vs D | 83.4 | 89.8 | 87.3 | 82.8 | 79.0 | 84.7 | 85.4 | 46.1 | 81.5 | 92.4 | 93.0 | 79.0 | 74.9 | 74.8 | 83.0 | 82.6 | 78.1 |
| A vs C | 81.9 | 81.2 | 82.7 | 76.6 | 83.8 | 83.2 | 81.6 | 43.0 | 81.7 | 85.1 | 87.5 | 61.6 | 85.3 | 84.9 | 84.1 | 84.1 | 80.0 |
| A vs W | 74.2 | 78.0 | 72.9 | 67.8 | 77.3 | 75.9 | 71.2 | 36.5 | 74.2 | 79.0 | 88.1 | 81.2 | 65.1 | 65.2 | 85.5 | 84.5 | 84.3 |
| A vs D | 80.9 | 80.9 | 79.6 | 73.9 | 81.5 | 81.5 | 74.8 | 43.6 | 84.7 | 79.6 | 91.1 | 79.0 | 78.0 | 75.8 | 83.3 | 85.4 | 65.6 |
| W vs C | 63.0 | 69.5 | 74.0 | 61.1 | 76.0 | 67.9 | 73.7 | 27.9 | 66.3 | 84.9 | 88.3 | 63.6 | 70.9 | 69.8 | 78.4 | 78.6 | 60.8 |
| W vs A | 73.8 | 74.6 | 79.7 | 71.2 | 86.1 | 76.0 | 80.5 | 29.8 | 73.6 | 90.3 | 93.1 | 96.2 | 80.0 | 77.6 | 84.5 | 83.4 | 73.6 |
| W vs D | 100.0 | 100.0 | 100.0 | 100.0 | 98.7 | 100.0 | 100.0 | 51.1 | 98.1 | 100.0 | 100.0 | 79.0 | 98.8 | 98.8 | 99.7 | 99.5 | 99.4 |
| D vs C | 52.7 | 68.8 | 80.2 | 61.2 | 75.9 | 68.0 | 75.5 | 24.2 | 69.1 | 85.0 | 87.1 | 66.7 | 77.3 | 77.8 | 79.6 | 78.1 | 66.5 |
| D vs A | 62.5 | 79.7 | 88.9 | 69.5 | 87.3 | 77.2 | 86.9 | 26.2 | 76.3 | 91.9 | 93.2 | 92.8 | 82.8 | 82.3 | 84.4 | 85.1 | 77.4 |
| D vs W | 89.8 | 97.6 | 99.3 | 98.6 | 95.6 | 98.3 | 99.0 | 33.7 | 93.9 | 99.7 | 99.0 | 83.1 | 99.0 | 98.8 | 98.7 | 98.6 | 99.0 |
| Mean | 77.7 | 82.4 | 84.9 | 77.1 | 83.9 | 81.9 | 82.9 | 37.1 | 80.5 | 88.9 | 92.3 | 79.4 | 81.6 | 81.1 | 86.6 | 86.6 | 80.1 |
| Dataset | TCA | JDA | GFK | SA | CORAL | CGCA | SCA | JGSA | MEDA | NSO (ours) |
|---|---|---|---|---|---|---|---|---|---|---|
| Newsgroup | 21.4 | 4.8 | 214.4 | 59.7 | 705.8 | 11977.0 | 59.0 | 3637.0 | 3447.0 | 2.64 |
| Reuters | 6.5 | 1.5 | 2.6 | 3.0 | 15.4 | 225.6 | 14.8 | 122.1 | 53.2 | 0.6 |
| CO - Surf | 3.2 | 0.9 | 0.6 | 0.7 | 0.4 | 6.4 | 12.2 | 10.8 | 6.3 | 0.2 |
| CO - Decaf | 1.8 | 0.4 | 1.1 | 1.3 | 10.6 | 99.8 | 10.3 | 79.8 | 45.0 | 0.2 |
| Overall | 8.2 | 1.9 | 54.7 | 16.2 | 183.1 | 3077.2 | 24.1 | 962.4 | 887.9 | 0.9 |
| Dataset | SO | NSOuniform | NSOclasswise | NSOker |
|---|---|---|---|---|
| Reuters | 94.8 | 97.6 | 97.6 | 80.8 |
| Newsgroup | 93.0 | 96.1 | 97.4 | 94.3 |
| CO - Surf | 79.3 | 79.1 | 79.3 | 56.5 |
| CO - Decaf | 79.2 | 79.4 | 79.4 | 76.4 |
| Overall | 86.2 | 88.1 | 88.4 | 77.0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
PCVM Probabilistic Classification Vector Machine PCTKVM Probabilistic Classification Transfer Kernel Vector Machine BT Basis-Transfer TKL Transfer Kernel Learning TCA Transfer Component Analysis SVM Support Vector Machine JDA Joint Distribution Adaptation SVD Singular Value Decomposition EVD Eigenvalue Decomposition RKHS Reproducing Kernel Hilbert Space PCA Principal Component Analysis NBT Nyström Basis Transfer SA Subspace Alignment SVM Support Vector Machine NTVM Nyström Transfer Vector Machine SURF Speeded Up Robust Features Extraction NSO Nysröm Subspace Override
11institutetext: University for Applied Sciences Würzburg-Schweinfurt, Sanderheinrichsleitenweg 20, Würzburg, Germany, 11email: {christoph.raab,frank-michael.schleif}@fhws.de
Low-Rank Subspace Override for Unsupervised Domain Adaptation
Christoph Raab1
Frank-Michael Schleif1
Abstract
Current supervised learning models cannot generalize well across domain boundaries, which is a known problem in many applications, such as robotics or visual classification. Domain adaptation methods are used to improve these generalization properties. However, these techniques suffer either from being restricted to a particular task, such as visual adaptation, require a lot of computational time and data, which is not always guaranteed, have complex parameterization, or expensive optimization procedures. In this work, we present an approach that requires only a well-chosen snapshot of data to find a single domain invariant subspace. The subspace is calculated in closed form and overrides domain structures, which makes it fast and stable in parameterization. By employing low-rank techniques, we emphasize on descriptive characteristics of data. The presented idea is evaluated on various domain adaptation tasks such as text and image classification against state of the art domain adaptation approaches and achieves remarkable performance across all tasks.
Keywords:
Transfer Learning Domain-Adaptation Single Value Decomposition Nyström approximation Subspace Override
1 Introduction
Supervised learning and, in particular, classification is an essential task in machine learning with a broad range of applications. The obtained models are used to predict the labels of unseen test samples. A basic assumption in supervised learning is that the underlying domain or distribution is not changing between training and test samples. If the domain is changing from one task to a related but different task, one would like to reuse the available learning model. Domain differences are quite common in real-world scenarios and, eventually, lead to substantial performance drops [35].
In image classification, a domain adaptation problem exists when the source and target data come from different cameras, as shown in Fig. 1. The domain adaptation problem occurs due to different camera characteristics between training and evaluation since cameras have different rendering and focus properties. More formally, let be source data samples in a -dimensional feature space from the source domain distribution with labels and let be target samples from the target domain distribution with labels . Traditional machine learning assumes similar distributions, i.e. , but domain adaptation assumes different distributions, i.e. .
Various domain adaptation techniques have already been proposed, following different strategies and improving the prediction performance of underlying classification algorithms in test scenarios [35, 22]. State of the art domain adaptation approaches [38, 34, 19, 7, 17] require a large number of source or target samples, which is indeed a disadvantage of many domain adaptation approaches and is not guaranteed in restricted environments where labeling is expensive [35]. In this work, we show that only a well-chosen subset of samples is necessary to approximate domain structures.
Despite the popularity of kernelized subspace adaptations [34, 16, 38] or manifold embeddings [10, 34, 21, 7] for domain alignment, it was shown in [2, 6] that least-squares approaches are at least competitive to more complicated settings, where domain differences are explicitly solved using least-squares to find a common subspace. Solutions to least-square problems are intuitive and theoretically justified. However, if both domains do not lie in a common subspace, this technique fails to transfer knowledge effectively [26]. We address this problem and evaluate a domain invariant subspace, where both domains are explicitly part of the target subspace, which neglects the mentioned drawback.
The main contribution of this work is to derive a subspace closed-form solution of the least-squares domain adaptation problem by finding a suitable domain invariant projection operator called Subspace Override (SO). The approach constructs a target subspace representation for both domains, which transfers target basis information to source data. We show that a well-chosen snapshot of the data is sufficient to approximate the domain characteristics by approximating the optimal solution of the least-squares problem. For the first time in domain adaptation, a Nyström approximation is used on subspace domain adaptation. The resulting method has a better prediction performance with stable parameterization and is easy to apply. Further, it is the fastest subspace domain adaptation algorithm in terms of computational complexity compared to related approaches, while maintaining its very good performance.
The rest of the paper is organized as follows: We give an overview of related work in Sec. 2. The underlying mathematical concepts are given in Sec. B. The proposed approach is discussed in Sec. 3, followed by an experimental part in Sec. 4, addressing the classification performance, computational time and the stability of the approach. A summary with a discussion of open issues is provided in the conclusion at the end of the paper. Source code, including all experiments and plots, is available at https://github.com/ChristophRaab/nso.
2 Related Work
In general, homogeneous transfer learning [35] or domain adaptation (DA) approaches, distinguish roughly between the following strategies:
The feature adaptation techniques [35] are trying to find a common latent subspace for source and target domain to reduce distribution differences, such that the underlying structure of the data is preserved in the subspace. A baseline approach for feature adaptation is Transfer Component Analysis (TCA) [21]. TCA finds a suitable subspace transformation called transfer components via minimizing the Maximum Mean Discrepancy (MMD) in the Reproducing Kernel Hilbert Space (RKHS). Joint Distribution Adaptation (JDA) [16] also considers MMD but incorporates class-dependent distributions. These works considered a subspace projection based on a combined eigendecomposition for both domains, which fails to include domain-specific attributes into the subspace. The Joint Geometrical Subspace Alignment (JGSA) [38] tackled this issue by searching MMD based subspaces for the domains individually. However, these methods rely on kernels and are not able to explore the full characteristics of the original feature space and are computationally intensive. Proposed work relies on original space and uses only a snapshot of data for computational efficiency.
Least-Squares (LS) adaptation is closely related to us, aligning both domains by finding a solution to the LS problem and use this solution as a feature transformation matrix. The transformation directly modifies the data or finds a subspace projection based on the eigenvectors of the domains. Subspace Alignment (SA) [6] computes a target subspace representation by direct modification of the correlation matrices of both domains. The Correlation Alignment (CORAL) [28] technique transfers second-order statistics of the target domain into whitened source data and project source and target data via principal component analysis (PCA) into the subspace. The Landmarks Selection-based Subspace Alignment (LSSA) [1] is a successor of SA and selects only a subset of both domains, which are near to domain borders to align these borders in the subspace explicitly.However, LSSA cannot capture the whole domain characteristic, and in supervised classification problems, the landmark sample is prone to omit class-information. Our work considers a uniform and class-wise sample strategy to capture the whole domain.
The work of Shao et al. [27] proposed that least-squares approaches, as above, are unable for effective adaptation, because the source and target data may lay not in a single subspace. In this work, we override the orthogonal basis of the source domain with the target one. With this, we model the source subspace domain as part of the target subspace, and subspace differences do not exist because both must lie in the same subspace by construction.
The considered domain adaptation methods have approximately a complexity of , where is the highest number of samples concerning target or source. All these algorithms require some unlabeled test data to be available at training time. These transfer-solutions cannot be directly used as predictors, but instead, are wrappers for classification algorithms.
3 Subspace Override
The task of domain adaptation is to align distribution differences with the goal that underlying statistics will be similar afterward. As in prior work [1, 13, 5, 6, 27, 37, 17, 24], we assume that similar matrices will lead to similar distributions. Hence, we strive for aligning the domain data matrices in a suitable subspace and model the source data to be part of the target data, and therefore it must be in the same (single) subspace.
To draw both domains closer together, represented by their respective samples and , consider the following optimization function
[TABLE]
The goal is to learn to adapt to the target domain. Further, we also make sure that the obtained projection operator is an orthogonal basis. This formulation has two flaws.
First, if sample sizes of source and target are not the same, i. e. , the above formula is invalid. We address the problem by a simple data augmentation strategy. If , is enriched by sampling new source data from the estimated Gaussian distribution of and assign random source labels until . If , source samples are randomly removed until sample sizes are equal. Hence, from know we assume .
Further, (1) prevents effective domain adaptation, because the transformation may project the data in different spaces [27]. However, if we model to be directly related to the target domain, the projection operator will be domain invariant. To get this kind of solution for problem (1), it must be rewritten that source data is part of the target subspace.
Let us consider the relationship between singular- and eigendecomposition and rewrite the PCA in terms of SVD. Given a rectangular matrix we can rewrite the eigendecomposition to
[TABLE]
with as singular values and are singular vectors of . Further, as eigenvalues and as eigenvectors of . A low rank solution and a reduction of dimensionality is integrated into the new data matrix by sorting and in descending order with respect to and choose only the biggest eigenvalues and corresponding eigenvectors
[TABLE]
with and and . is the reduced target matrix and only the most relevant data w.r.t. to variance is kept. In (3) a linear covariance or kernel is used, but non-linear kernels like the RBF kernel could be integrated as well.
With the insights of (3) and (4), we rewrite the optimization problem in (1) to a low-rank subspace version and state the main optimization problem:
[TABLE]
Based on domain relatedness and standardization techniques, we assume that singular values are similar, i. e. and fix them. Naturally, this assumption does not always hold. See Sec. 3.2 for a discussion. If they are fixed, then the optimal solution to (5) is easily obtained by solving the linear equation and obtain the solution . By applying to (5) the source data becomes
[TABLE]
and is used for training an invariant classifier. The resulting model can be evaluated on . This overrides the source basis and prevents the source subspace to be arbitrarily different from the target due to the affiliation to the target space. The solution also fulfills the constrains because is an orthogonal matrix due to the orthogonal matrices and . In particular, (7) projects the source data onto the principal components of the subspace basis of . If data matrices and are standardized, the geometric interpretation is a rotation of source data w.r.t to angles of the target basis. We call this procedure Subspace Override (SO).
This procedure requires a complete eigenspectrum and scales to in worst case [36]. Further, all available data is required for this approach. Using Nyström techniques, we show that only a subset of the data is required, which simultaneously reduces computational complexity and eliminates the need to examine all singular values.
3.1 Nyström Extension
For clarity, the following notation will overlap with the previous section but keeps things simple. We assume the reader is familiar with Nyström SVD techniques. Otherwise, the reader may consider Appendix B for an introduction to the Nyström approximation.
In short, the Nyström SVD technique is a low-rank approximation which decomposes a given matrix into the constitution
[TABLE]
with , , and . The matrix contains the random samples called the landmark matrix. Given , the singular value decomposition , and , the full SVD of is reconstructable, which is similar to the following approach.
Consider and with the decomposition as in (8). For a Nyström SVD, we sample from both matrices rows/columns obtaining landmarks matrices and . The target data is projected into the subspace as in (4) via the Nyström technique (Appendix B) and keeps only the most relevant data structures via
[TABLE]
Analogously, the source data could be approximated by . The Nyström technique is also used to approximate the solution to the optimization problem with and project the source data into the target subspace via
[TABLE]
Hence, it is sufficient to only compute a Singular Value Decomposition (SVD) of and instead of and with and therefore is considerably lower in computational complexity.
By definition of the Nyström approximation, it is and is an orthogonal basis. Therefore, the subspace projections are orthogonal transformations and fulfill the constrains of (5).
Besides small sample requirements, the major advantage of using the approximated low-rank solution in favor of the optimal solution is that singular values that are closer to zero are set to zero, reducing the noise of the data in the subspace. Therefore the approach focuses on intrinsic data characteristics, which should lead to better classification performance.
Subsequently, this approach is denoted as Nyström Subspace Override (NSO). The matrix is used for training, and is used for testing. But uniform sampling may not be optimal for Nyström, given a classification task [25]. Therefore, we subsequently integrate class-wise sampling in the following. Pseudo code shown in Algorithm 1.
3.2 Sampling Strategy
The standard technique to create Nyström landmark matrices is to sample uniformly or find clusters in the data [30]. In supervised classification with more than two classes, class-wise sampling should be utilized to properly include class-depending attributes of a matrix into the approximation [25]. However, a decomposition as in (21), required for Nyström SVD, is intractable with class-wise sampling, because respective matrices are non-square. Let with and landmark indices with at least one and if , then it is undefined. Therefore, we sample rows class-wise and obtain instead of , making it possible to sample from the whole range of source data. The sampling from test data is done uniformly row-wise, because of missing class information. The resulting singular value decompositions, i. e. and , are utilized for successive Nyström approximations.
However, the possible numerical range of and is naturally not the same, which is easily shown by the Gerschgorin Bound (Theorem B.1 in Appendix B.3). It scales approximated matrices different by and accurate scaling of the singular vectors cannot be guaranteed. Therefore, we apply a post-processing correction and standardize the approximated matrices to transform the data back to mean zero and variance one. The singular vectors also have an approximation error. However, both subspace projections are based on the same transformation matrix, hence making an identical error, and as a result, the error should not affect the classification.
The process of Nyström Subspace Override (NSO) is given in Fig. 2. The first column visualizes the samples of Nyström to create the approximated set of subspace projection operators. The second column shows the data after the subspace projection. The similarity in structure but dissimilarity in scaling, as discussed above, is visible. The last column shows the data after applying post-correction and leading to a high similarity afterward. The pseudo code of NSO is shown in Algorithm 1.
3.3 Properties of Nyström Subspace Override
The computational complexity of Nysröm Subspace Override (NSO) is composed of economy-size SVD of landmark matrices and with complexity . The matrix inversion of diagonal matrix in (9) can be neglected. The remaining matrix multiplications are of complexity and are therefore contributing to the overall complexity of NSO, which is with . This makes NSO the fastest subspace domain adaptation solution in terms of computational complexity in comparison to compared methods in Sec. 4.
The out-of-sample extension for unseen target/source samples, e. g. , is analog to (9). Based on (4), a subspace projection via (approximated) right singular vectors is also valid. Hence, a sample can be projected into the subspace via
[TABLE]
and be evaluated by an arbitrary classifier learned in the subspace.
The difference between source and target domain after SO, i. e. approximation error of source by target domain is bounded by
[TABLE]
Where is the -th singular value in descending order of and respectively and . The proof can be found in Appendix A. As in prior LS approaches [6, 28, 1], we want NSO to minimize the difference between the source and target data. In Eq. (12) is shown that NSO has a lower norm to the original data and proves that the matrices are aligned during NSO, making them numerically more similar. Note that similar matrices not necessarily indicate a good classification performance in terms of accuracy by an arbitrary classifier in a domain adaptation setting. The classification performance is evaluated in the following.
4 Experiments
We follow the experimental design typical for domain adaptation algorithms [3, 17, 10, 16, 15, 22, 21, 28, 19, 6, 1, 38]. The tests are conducted on the common datasets Reuters, Newsgroup and Office-Caltech. A crucial characteristic of datasets for domain adaptation is that domains for training and testing are different but related, e. g. sharing the same categories. The NSO approach is evaluated against the common and state of the art domain adaptation methods TCA [21], GFK [10], JDA [16], SA [6], CORAL [28], EasyTL[33], SCA [7], MEDA [34] and JGSA [38]. We extend the object detection study by also evaluating against deep DA networks. We follow [34] and use the Alexnet [12] as the baseline for Deep-Coral[29], JAN[18], DAN [14] and DDC[31]. The networks are always trained on original images. The parameters for the respective method are determined for the best performance in terms of accuracy via grid search. In the experiments, the Support Vector Machine (SVM) independent of being a baseline or underlying classifier for domain adaptation methods uses the RBF-Kernel. All experiments are done via the standard sampling protocol [18] and use all available source and target data. We did 20 test runs and summarized the result as mean accuracy.
4.1 Dataset Description
A summary of all datasets is shown in Tab. 1. Regardless of the dataset, it has been standardized to standard mean and variance.
Reuters-21578 [3]: A collection of Reuters news-wire articles collected in 1987 as TFIDF features. The three top categories organization (orgs), places and people are used in our experiment.
To create a transfer problem, a classifier is not tested with the same categories as it is trained on, e. g. it is trained on some subcategories of organization and people and tested on others. Six datasets are generated: orgs vs. places, orgs vs. people, people vs. places, places vs. orgs, people vs. places and places vs. people. They are two-class problems with the top categories as the positive and negative class and with subcategories as training and testing examples.
20-Newsgroup [15]: The original collection has approximately 20.000 text documents from 20 Newsgroups and is nearly equally distributed in 20 subcategories. The top four categories are comp, rec, talk and sci, each containing four subcategories. We follow a data sampling scheme introduced by [17] and generate 216 cross domain datasets based on subcategories, which are summarized as mean over all test runs as comp vs rec, comp vs talk, comp vs sci, rec vs sci, rec vs talk and sci vs talk.
Caltech-Office (OC) [10]: The first, Caltech (C), is an extensive dataset of images and contains 30.607 images within 257 categories. The Office dataset is a collection of images drawn from three sources, which are from amazon (A), digital SLR camera (DSLR) and webcam (W). They vary regarding camera, light situation and size, but ten similar object classes, e. g. computer or printer, are extracted for a classification task. We use SURF [10] and DeCaf[4] features.
4.2 Performance Results
The results are shown per dataset separately. The results on Newsgroup in Tab. 2, Reuters in Tab. 3, OC with Surf features in Tab. 4, OC with decaf and deep DA methods in Tab. 5. Summarizing, our NSO algorithm is basically the best on Reuters and Newsgroup data. The only competitive algorithm is SA on Reuters data with similar results to ours. SA is also an LS subspace approach. However, SA is outperformed by NSO at Newsgroup. NSO demonstrates its usefulness for large sparse matrices that are given at these datasets. At the OC-Surf dataset, the NSO outperforms on many datasets and has the best mean accuracy. Only at OC-Decaf features, NSO is midfield in performance, but it is still competitive. We assume that the Decaf features are very dense feature matrices in terms of descriptive information even if the singular values are small. Therefore, the low-rank approximation is contra-productive.
The intriguing part of this evaluation comes with the cross-task evaluation. While SA is very good at Reuters and Newsgroup, it has bad performance on OC datasets. While MEDA and JGSA have poor performance at Reuters and Newsgroup, they are good at OC datasets. Our NSO approach is in three out of four tasks the recommendable choice showing convincing task-independent performance. In Fig. 3, the parameter sensitivity is shown and demonstrates that the parameterization (number of landmarks) of NSO is stable, simple to optimize and supports the Nyström error expectation.
4.3 Time Results
The mean time results of the subspace DA methods in seconds are shown in the Tab. 6. The deep DA methods are not presented as they are unrivaled to the traditional methods. The experiments shows that our NSO approach is task-independent, the fastest algorithm. Compared to recent MEDA, JGSA and CGCA, the NSO approach needs substantially less time. The related SA approach is also fast, but as theoretically derived, the override of a subspace basis approximated by Nyström leads to a boost in computational performance. In summary, the NSO approach is efficient and should be favored with regard to Green AI.
5 Conclusion
We proposed a low-rank domain approximation algorithm called Nyström Subspace Override. It overrides the source basis with the target basis, which is designed as a domain invariant subspace projection operator. Due to the affiliation of the operator to the target space, we make sure that both domains lie in the same subspace. It requires only a subset of domain data from both domains and provides a subspace variant of the domain adaptation-related least-squares problem. The Nyström based projection, paired with smart class-wise sampling, showed its reliability and robustness in this study. Validated on common domain adaptation tasks and data, it showed a convincing performance. Additionally, NSO has the lowest computational complexity and time consumption compared to discussed solutions, which makes the approach favorable in the light of Green AI. The next steps are a theoretically evaluation of the Nyström approximation error with the proposed decomposition.
Acknowledgment
We are thankful for support in the FuE program Informations- und Kommunikationstechnik of the StMWi, project OBerA, grant number IUK-1709-0011// IUK530/010.
Appendix A Proof of Subspace Override Bound
Theorem A.1.
Given two rectangular matrices with and rank of and . The norm in the subspace induced by normalized subspace projector with is bounded by
[TABLE]
Following [11] the squared Frobenius norm of a matrix difference between two matrices can be bounded by
[TABLE]
where and is the -th singular value of the respective matrix in descending order. However, the subspace matrices and are a special case due to the subspace override of the projector , because
[TABLE]
The important fact in the right part of Eq. (16) and (17) is that we do not rely on the bound of the Frobenious inner product as in the proof for Eq. (14) [11, p. 459], because . Therefore, we can directly compute the Frobenius inner product of the the diagonal matrices and , which is simply the sum of the product of the singular values. Consequently follows for and ,
[TABLE]
where again and .
Appendix B Mathematical Background
We introduce the basics of the Nyström kernel approximation in Sec. B.1, which is the foundation of the Nyström based Singular Value Decomposition in Sec. B.2. The Nyström SVD is used for constructing an approximated subspace transformation of (Nyström) Subspace Override in Sec. 3.1.
B.1 Nyström Approximation
The computational complexity of calculating kernels or eigensystems scales with where is the sample size [36]. Therefore, low-rank approximations and dimensionality reduction of data matrices are popular methods to get better computational performance. In this scope, however, not limited to it, the Nyström approximation [36] is a reliable technique to accelerate eigendecomposition or approximation of general symmetric matrices [8].
It computes an approximated set of eigenvectors and eigenvalues based on a usually much smaller sample matrix. The landmarks are typically picked random, but advanced sampling concepts could be used as well [30]. The approximation is exact if the sample size is equal to the rank of the original matrix and the rows of the sample matrix are linear independent [8]. In general, the Nyström approximation technique assumes a symmetric matrix with a decomposition of the form
[TABLE]
with , , and . The matrix is called the landmark matrix containing randomly chosen rows and columns from and has the Eigenvalue Decomposition (EVD) . The eigenvectors are and the eigenvalues are on the diagonal of . The remaining approximated eigenvectors of as part or , are obtained by the Nyström method with . Combining and the full approximated eigenvectors of are
[TABLE]
The right part of the EVD () of can be obtained via Nyström similar to (22) by
[TABLE]
Combining (22), (23) and , the matrix is approximated by
[TABLE]
The Frobenius Norm gives the Nyström approximation error between ground truth and reconstructed matrices, i. e. , with bounds proven by [9].
B.2 General Matrix Approximation
Another application of the Nyström method is the approximation of the Singular Value Decomposition, which generalizes the concept of matrix decomposition with the consequence that respective matrices must not be squared [20].
Let be a rectangular data matrix. Following [20], a decomposition as in (21) can be obtained. The SVD of the landmark matrix is given by where are left, and are right singular vectors. are non-negative singular values. The left and right singular vectors for the non-symmetric part and are obtained via Nyström techniques and are defined as and respectively [20]. Applying the same principal as for Nyström-EVD, is approximated by
[TABLE]
B.3 Gerschgorin Theorem
The Gerschgorin theorem [32] provides a geometric structure to bound eigenvalues to so-called discs for complex square matrices, but also generalize to none complex square matrices. The work of [23] expands the Gerschgorin circles to so-called Gerschgorin type circles for singular values:
Theorem B.1 (Gerschgorin Type Bound for Singular Values [23]).
Given the matrix with , the singular values of are in the range of
[TABLE]
Where , and the range is defined as
[TABLE]
By using theorem B.1 we can bound the norm of the singular values of by the square root of the squared sum of the numerical range given by
[TABLE]
Appendix C Component Analysis
We inspect the performance contribution of the different parts of the NSO approach. First, the exact solution to the optimization problem is called Subspace Override (SO). The approximation with uniform sampling is evaluated to study the impact of class-wise sampling on the performance. To show the efficiency of the subspace projection in original space, we include a kernelized version where we approximate the RBF-kernels of and , respectively. The results are given in Tab. 7 and show that the Nyström approximation independent of the sampling strategy yields the best performance. This comes from the approximation of the subspace projection, where small values are likely to be zero, hence reducing noise further. The kernelized version is not recommended due to bad performance. Overall, as proposed, the class-wise NSO is recommended, because it is slightly better.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Aljundi, R., Emonet, R., Muselet, D., Sebban, M.: Landmarks-based kernelized subspace alignment for unsupervised domain adaptation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 07-12-June, pp. 56–63. IEEE (jun 2015)
- 2[2] Blitzer, J., Foster, D., Kakade, S.: Domain adaptation with coupled subspaces. Journal of Machine Learning Research 15 , 173–181 (2011)
- 3[3] Dai, W., Yang, Q., Xue, G.R., Yu, Y.: Boosting for transfer learning. In: Proceedings of the 24th international conference on Machine learning - ICML ’07. pp. 193–200. ACM Press, New York, New York, USA (2007)
- 4[4] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: De CAF: A deep convolutional activation feature for generic visual recognition. 31st International Conference on Machine Learning, ICML 2014 2 , 988–996 (2014)
- 5[5] Elhadji-Ille-Gado, N., Grall-Maes, E., Kharouf, M.: Transfer Learning for Large Scale Data Using Subspace Alignment. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA). vol. 2018-Janua, pp. 1006–1010. IEEE (dec 2017)
- 6[6] Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. Proceedings of the IEEE International Conference on Computer Vision pp. 2960–2967 (2013)
- 7[7] Ghifary, M., Balduzzi, D., Kleijn, W.B., Zhang, M.: Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (7), 1414–1430 (jul 2017)
- 8[8] Gisbrecht, A., Schleif, F.M.: Metric and non-metric proximity transformations at linear costs. Neurocomputing 167 , 643–657 (2015)
