TL;DR
This paper introduces a novel local manifold approximation classifier called LOMA, which improves classification accuracy for complex, overlapping, and intersecting data supports, especially with limited training data.
Contribution
It proposes a new local approximation-based classification method, with a specific sphere-based implementation called SPA, demonstrating superior performance over existing methods.
Findings
SPA outperforms competitors on simulated data
SPA achieves substantial accuracy gains on real datasets
The method effectively handles complex, nonlinear class supports
Abstract
Classifiers label data as belonging to one of a set of groups based on input features. It is challenging to obtain accurate classification performance when the feature distributions in the different classes are complex, with nonlinear, overlapping and intersecting supports. This is particularly true when training data are limited. To address this problem, this article proposes a new type of classifier based on obtaining a local approximation to the support of the data within each class in a neighborhood of the feature to be classified, and assigning the feature to the class having the closest support. This general algorithm is referred to as LOcal Manifold Approximation (LOMA) classification. As a simple and theoretically supported special case having excellent performance in a broad variety of examples, we use spheres for local approximation, obtaining a SPherical Approximation (SPA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Classification via local manifold approximation
Didong Li1 and David B Dunson1,2
Department of Mathematics1 and Statistical Science2, Duke University
Classifiers label data as belonging to one of a set of groups based on input features. It is challenging to obtain accurate classification performance when the feature distributions in the different classes are complex, with nonlinear, overlapping and intersecting supports. This is particularly true when training data are limited. To address this problem, this article proposes a new type of classifier based on obtaining a local approximation to the support of the data within each class in a neighborhood of the feature to be classified, and assigning the feature to the class having the closest support. This general algorithm is referred to as LOcal Manifold Approximation (LOMA) classification. As a simple and theoretically supported special case having excellent performance in a broad variety of examples, we use spheres for local approximation, obtaining a SPherical Approximation (SPA) classifier. We illustrate substantial gains for SPA over competitors on a variety of challenging simulated and real data examples.
1 Introduction
Classification is one of the canonical problems in statistics and machine learning. In the typical setting, the focus is on learning a classifier mapping from features to class labels . Classifier learning relies on training data , , in which both the features and labels are known. This article focuses on developing methods for difficult classification problems in which the dimension of the features is not small, the amount of training data may be limited, and the distributions of the features within the different classes are non-linear, intersecting and ‘entangled’ with each other.
To clarify the conceptual challenges, we introduce some notation and two motivating toy problems; refer to Figure 1. Let denote the conditional density of the features given class label , and let denote a region of feature space having high density for data in class . Classification is a relatively easy problem when the regions can be separated with hyperplanes, as in the example in Figure 1a. In this case, standard classifiers such as logistic regression (Cox, (1958)) and support vector machines (Cortes and Vapnik, (1995)) tend to have excellent performance. However, it is increasingly common in modern applications to be faced with cases more similar to Figure 1b. In this example, the supports and have minimal overlap, so that accurate classification performance is conceptually possible. However, most existing algorithms are incapable of learning the classification boundaries accurately, particularly when training data are limited.
In recent years, there has been abundant focus on classifiers based on deep learning and multilayer neural networks (refer, for example to Schmidhuber, (2015) and Ciresan et al., (2011)). Such approaches have particularly outstanding performance in imaging and in other structured data settings. Key advantages of neural networks include their amazing flexibility, very high learning capacity and classification accuracy under careful tuning and architecture design utilizing large training data sets. However, this great flexibility leads to challenges in cases with limited training data. Limited data makes estimation of the very many neural network parameters problematic even utilizing the rich variety of dimensionality reduction tools that have been developed in recent years. Potentially the architecture can be simplified but then classification accuracy can suffer, either due to a decrease in representation capacity or to over-fitting when the choice of architecture is driven by training data performance.
We propose a simple class of algorithms to overcome these problems in limited data settings. In particular, suppose we wish to estimate for some arbitrary feature vector . We first locally approximate the denoised support of the feature density within class , for each , and then set equal to the class having the minimal Euclidean distance between and the denoised support estimate. This approach is illustrated in Figure 2 for the example in Figure 1b. We refer to this general approach as LOcal Manifold Approximation (LOMA) classification.
We use the term ‘manifold’, since we mathematically assume that the denoised support within a small neighborhood of is a compact Riemmanian manifold. This provides a convenient and flexible mathematical representation, which is useful for formalizing the performance of our proposed method. There is a very rich literature on manifold learning algorithms ranging from Local Linear Embeddings (LLE) (Roweis and Saul, (2000)) to Diffusion Maps (Coifman and Lafon, (2006)). The focus of this literature is on non-linear dimensionality reduction, replacing a relatively high-dimensional feature vector with a lower-dimensional set of coordinates while maintaining distances between pairs of points. Manifold learning can be viewed as a non-linear alternative to PCA, and can potentially be used as a first stage dimensionality reduction before applying a classification algorithm. However, our LOMA approach is fundamentally different from such two-stage approaches, and has much better performance in cases we have considered. One reason could be that we avoid any global manifold assumption, which is likely much too restrictive in many applications.
In practice, LOMA requires a specific choice of local manifold approximation. Motivated by settings with limited training data and by a desire for computational efficiency and transparency, we propose to use spheres for this purpose - obtaining a SPherical Approximation (SPA) classifier. SPA relies on the Spherical Principal Components Analysis (SPCA) algorithm developed recently by Li et al., (2018), as a key component of their Spherelets manifold learning algorithm. Other than using SPCA, the current paper has no overlap with Li et al., (2018), and the LOMA framework is completely novel to our knowledge.
In Section 2 we provide precise details of the LOMA framework, including the SPA special case, and give theoretical support in terms of asymptotic optimality theory. In Section 3 we apply SPA to the motivating application and two real datasets, and compare SPA to a rich variety of alternative classifiers, showing competitive performance. Proofs are included in the Appendix. Code for implementation of SPA is available at https://github.com/david-dunson/SPAclassifier.
2 Local manifold approximation classifier
Without loss of generality, we focus on the binary classification case for ease in exposition. Assume there are two groups of data, labeled by and . The features for example tend to be close to one manifold , while features for group tend to be close to a different manifold and both manifolds are embedded in with intrinsic dimension . For theoretical purpose in evaluating our proposed method, we assume , where is Gaussian noise and is a denoised location exactly on a manifold.
Both manifolds may be highly nonlinear, and complex, having varying curvature and even gaps. In addition, the manifolds may be entangled with each other, having multiple intersections and close overlaps; such complexity naturally arises in many real world settings and presents fundamental challenges to current classifiers.
For a given test sample to be classified, we calculate the distance between and the two manifolds, denoted by and . Then, we simply assign to the group with the shorter distance. The key computational step in the LOMA algorithm is the calculation of and . In practice, the two manifolds and are unknown, but we have training data containing both and the class label , for different samples. The label is equal to one if example is in group and is equal to two if the example is in group . We can use this information to obtain accurate local approximations to the manifolds and within a neighborhood of the feature to be classified. One can potentially consider a broad variety of local approximations, obtaining different versions of LOMA classification. However, from a practical perspective, it is important to use a local approximation that (a) is parsimonious to make efficient use of limited training data and (b) leads to an analytic form for calculation of and . Local linear approximations satisfy (a)-(b) but fail to capture curvature.
With this motivation, we instead use local spherical approximations. Spheres are simple geometric objects that are easy to fit and work with, providing a generalization of hyperplanes that can dramatically improve accuracy through approximating the local curvature. The center, radius and dimension of each sphere are optimized to provide the best approximation. The distance and can then be easily calculated relying on the spherical approximation. Let be the nearest neighbors of among samples with label . We fit a sphere to these points using SPCA, obtaining as the local manifold approximation around in class . We then approximate by and choose the label as the value of having the smallest . When applied to data , SPCA produces an estimated -dimensional sphere having center and radius lying in subspace . , and are obtained by Algorithm 1. Algorithm 2 provides pseudo code to implement the SPA algorithm.
SPA is designed to be simple and directly targeted to detect differences in non-linear support across groups, leading to substantial gains when training data are limited in size. SPA can learn quickly with fewer training data and requires no manual tuning. The only tuning parameters are , the size of the local neighborhoods, and , the dimension of the manifold approximating the denoised support of the data. can be set to be a default value dependent on to avoid tuning, while is easy to tune automatically in being a small integer. This tuning can be done on a subset of the training data, adding a negligible computational cost. Also, SPA has theoretical guarantees on classification performance as training sample size grows, given by the following two theorems correspond to clean data and noisy data.
Theorem 1**.**
Let and correspond to two compact Riemannian manifolds. Assume and , . Given a test sample with true label , let be the predicted label obtained by the SPA classifier, then
[TABLE]
In Theorem 1, the data in class are assumed to take values in , which is a compact Riemannian manifold without noise. The overall density of the data across the classes is . The theorem shows that as the training sample size increases, the probability the algorithm produces exactly the correct class label is eventually greater than , where is the probability assigns to the intersection region between and .
As a corollary, when , the limit is one. This means that SPA will have perfect classification performance for sufficiently large training sample sizes as long as the classes are geometrically separable or the intersection region has measure zero. Theorem considers the noisy case, which is more realistic in most applications.
Theorem 2**.**
Let and correspond to two compact Riemannian manifolds. Assume and , , , where . Given a test sample with true label , let be the predicted label obtained by the SPA classifier. Let and , then
[TABLE]
In Theorem 2, the data in class are distributed around with Gaussian noise. The theorem shows that as the training sample size increases, the probability the algorithm produces the wrong class label is asymptotically bounded by
[TABLE]
As the noise level decays to zero, , let so the first term converges to . Since , the second term converges to [math], so the bound in Theorem 2 coincides with the bound in Theorem 1. This is not surprising since Theorem 1 is a special case of Theorem 2, that is, the case when the noise is zero. The quantity can be viewed as the “signal-to-noise” ratio in this setting. The larger , the better the performance of the SPA classifier.
3 Numerical Examples
Funky Curves
We first consider the example from Figure 1b. For , we let , where is a randomly generated point on a highly non-linear curve and is a zero mean Gaussian noise. Half of the data are reserved as a test set and a proportion of the other half is used in training. The curves are entangled and overlapping, making the classification problem highly challenging. Most classification algorithms are completely unable to deal with entangled, overlapping and intersecting non-linear supports. SPA is specifically designed to easily accommodate this problem, and hence should outperform other algorithms in this and related settings. Complex black box algorithms, such as deep neural networks, can eventually do well, but require much larger training sample sizes. Figure 3 shows the accuracy versus training data sample size plot of several competing algorithms including Complex Trees (Quinlan, (1986)), Fine KNN (Cover and Hart, (1967)), Fine Gaussian SVM (Crammer and Singer, (2001)) and Deep Neural Networks (Schmidhuber, (2015)). These were chosen as the best from among dozens of competitors. The plots show that our SPA algorithm has the highest accuracy, which is over when the sample size is only . In addition, an important point is that SPA can produce excellent performance with very limited training data. This is important because in many applied domains, training data are very expensive and one must make do with small samples. SPA thrives in such settings, beating competing classifiers when training data are limited. As sample size increases, carefully tuned Deep Neural Networks will slowly close the gap. Unfortunately, outside of certain specialized settings, labeled training data are a very limited and valuable resource.
USPS Digits
We also test our SPA classifier on the USPS digits dataset, one of the standard datasets used for evaluating image classification algorithms. Each sample is an image with pixels in gray scale . There are samples and classes represents digits . When the training sample size is only of the entire set, which is , the average sample size within each class is only while the dimension of each sample is . Clearly this small sample size is far from enough for complicated algorithms, for example Convolutional Neural Networks (CNN, Ciresan et al., (2011)), the most popular algorithm in image classification. Figure 4 shows the accuracy plot, which matches our expectation. When training sample size is only , SPA has accuracy about , higher than competitors.
SPA works well for this example because although the images have large dimension (256), they are actually lying close to a lower dimension subspace. SPA discovers this lower dimensional geometric structure and simplifies the problem. This ability to learn simple structure in outwardly complex data is a major advantage over black box classifiers, such as DNNs and random forests.
Libras Movement
The last dataset we consider is the Libras Movement dataset that is available in the UCI Machine Learning Repository. The entire dataset has classes with instances for each class, so the total sample size is . Each class represents a type of hand movement in LIBRAS, the official Brazilian signal language. At each time the -d coordinates of the centroid of the hand are recorded and there are such records so each movement is represented by a dimensional vector. As previously, we preserve half the data for testing and use the other half as training samples. We are not varying the training sample size since there are too few samples in this example. For instance, if we use samples to train our model, there will be only or samples within each class while the dimension of each sample is . Figure 5 shows that SPA is the only algorithm with accuracy greater than among many other popular algorithms including KNN, Kernel SVM, Subpace KNN and Deep Neural Networks. In this setting, the training sample size is only , that is, there are only samples within each class while the dimension of the ambient space is , making this problem even more challenging that the USPS dataset, where the corresponding dimensions are . The intrinsic dimension in the SPA classifier is chosen to be in this example, substantially decreasing the complexity and difficulty of this problem. The intuition for is that since each class represents a certain type of hand movement, which is a -dimensional curve, it is reasonable to assume that .
4 Discussion
The LOMA framework is quite broad and there are multiple promising directions for future research. The first is to allow the neighborhood sizes and/or the manifold dimension to vary. This variation can be either local, according to the point that you want to classify, or global across classes. Allowing to be varying significantly adds to the flexibility of the approach. In practice, it is unlikely that most real world datasets have an exact manifold structure even up to iid Gaussian measurement error. Even with a fixed , LOMA is only using the manifold assumption as a local approximation, but still the support may be more complex in certain regions of the feature space than others and for certain classes. In such cases, there are potential gains to incorporating a varying . However, hurdles include the increased computation and potential need for larger training data for use in tuning . As cross validation is trivially parallelized, the computational barrier is not a major obstacle and one may fit local approximations quickly for different choices of at each and combine these approximations in an ensemble learning algorithm, avoiding selection of a single manifold dimension.
An additional direction is to explore alternatives to spheres to develop other special cases of LOMA that may have better performance in certain contexts. Spheres are remarkably successful at providing a simple modification to hyperplanes that can allow both positive and negative curvature. However, the performance of SPA will decrease when the curvature varies significantly in different directions, for example, image that the point x is on the unit cylinder where the normal curvature ranges from 0 to 1. To accommodate varying curvature, one may use quadratic surfaces as local manifold approximations. This leads to an increase in the number of parameters needed within each local neighborhood, so that may need to increase, and it remains to be seen whether efficient computational algorithms can be developed.
Finally, it is important to generalize LOMA to more complex settings. Several examples include high-dimensional features in which is very large, settings in which features are collected dynamically over time, and cases in which features are not real-valued. To accommodate high-dimensional cases, one straightforward extension is to modify SPCA to use sparse PCA (Zou et al., (2006)) within Step 1 of Algorithm 1. To allow features that are not real numbers, one can instead rely on a distance metric between pairs of features, chosen to be appropriate to the scale of the data, modifying Algorithms 1 and 2 appropriately.
Acknowledgments
The authors acknowledge support for this research from the Office of Naval Research grant N000141712844.
Appendix
Proof of Theorem 1
Proof.
Assume , so but by the compactness of . Let be the estimate of obtained by the SPA classifier. From Corollary 3 in Li et al., (2018), we know that in probability as for so Recalling the definition of , we know that . Hence we conclude that A similar equation holds for all . Combining the above two situations, we conclude that as long as , the prediction is correct asymptotically, so we have the desired result
[TABLE]
∎
Proof of Theorem 2
Proof.
Without loss of generality, assume so . By similar argument as in the proof of Theorem 1, we have
[TABLE]
Then observe that for any ,
[TABLE]
Then we consider the second term . Recall that , assume and , then
[TABLE]
while
[TABLE]
As a result, . Since , . For convenient, define , then the tail probability can be controlled by Chernoff’s inequality:
[TABLE]
The right hand side is maximized by , so we have the final upper bound for the desired tail probability:
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ciresan et al., (2011) Ciresan, D. C., Meier, U., Masci, J., Maria Gambardella, L., and Schmidhuber, J. (2011). Flexible, high performance convolutional neural networks for image classification. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence , volume 22, page 1237. Barcelona, Spain.
- 2Coifman and Lafon, (2006) Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis , 21(1):5–30.
- 3Cortes and Vapnik, (1995) Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning , 20(3):273–297.
- 4Cover and Hart, (1967) Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory , 13(1):21–27.
- 5Cox, (1958) Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) , pages 215–242.
- 6Crammer and Singer, (2001) Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research , 2(Dec):265–292.
- 7Li et al., (2018) Li, D., Mukhopadhyay, M., and Dunson, D. B. (2018). Efficient manifold and subspace approximations with spherelets. ar Xiv preprint ar Xiv:1706.08263 .
- 8Quinlan, (1986) Quinlan, J. R. (1986). Induction of decision trees. Machine Learning , 1(1):81–106.
