Rare geometries: revealing rare categories via dimension-driven statistics
Henry Kvinge, Elin Farnell, Jingya Li, Yujia Chen

TL;DR
This paper introduces a new supervised learning algorithm that uses a dimension-driven statistic called the kappa-profile to detect rare classes in data, requiring few labeled examples and handling both separable and non-separable cases.
Contribution
The paper proposes a novel dimension-based statistic and a classification algorithm specifically designed for rare-category detection with minimal labeled data.
Findings
Effective detection of rare classes with few labels
Invariant to translation, working on separable and non-separable classes
Demonstrates improved performance over existing methods
Abstract
In many situations, classes of data points of primary interest also happen to be those that are least numerous. A well-known example is detection of fraudulent transactions among the collection of all financial transactions, the vast majority of which are legitimate. These types of problems fall under the label of `rare-category detection.' There are two challenging aspects of these problems. The first is a general lack of labeled examples of the rare class and the second is the potential non-separability of the rare class from the majority (in terms of available features). Statistics related to the geometry of the rare class (such as its intrinsic dimension) can be significantly different from those for the majority class, reflecting the different dynamics driving variation in the different classes. In this paper we present a new supervised learning algorithm that uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Anomaly Detection Techniques and Applications · Rough Sets and Fuzzy Logic
Rare geometries: revealing rare categories via dimension-driven statistics
Henry Kvinge *Corresponding author Department of Mathematics
*Colorado State University
*Fort Collins, CO, USA
Elin Farnell† †Elin Farnell’s contributions to this work were completed while she was a research scientist in the Department of Mathematics at Colorado State University. Amazon
Seattle, WA, USA
Jingya Li
Department of Mathematics
*Colorado State University
*Fort Collins, CO, USA
Yujia Chen
Department of Mathematics
*Colorado State University
*Fort Collins, CO, USA
Department of Mathematics
Colorado State University
Fort Collins, CO 80523-1874
Abstract
In many situations, classes of data points of primary interest also happen to be those that are least numerous. A well-known example is detection of fraudulent transactions among the collection of all financial transactions, the vast majority of which are legitimate. These types of problems fall under the label of ‘rare-category detection.’ There are two challenging aspects of these problems. The first is a general lack of labeled examples of the rare class and the second is the potential non-separability of the rare class from the majority (in terms of available features). Statistics related to the geometry of the rare class (such as its intrinsic dimension) can be significantly different from those for the majority class, reflecting the different dynamics driving variation in the different classes. In this paper we present a new supervised learning algorithm that uses a dimension-driven statistic, called the kappa-profile, to classify whether unlabeled points belong to a rare class. Our algorithm requires very few labeled examples and is invariant with respect to translation so that it performs equivalently on both separable and non-separable classes.
Index Terms:
Machine learning, rare-category detection, geometric data analysis, secant-based dimensionality reduction.
Henry Kvinge, Elin Farnell, Jingya Li, Yujia Chen
I Introduction
Rare-category detection is a common problem in real-world settings where it is often the case that classes that are most important to identify are least well-represented in available datasets. In such cases, we may have only a handful of labeled examples of the rare class even though we have many labeled examples of the majority class. Typical examples include identification of financial fraud, malicious insiders in an organization, fraudulent transactions, rare diseases, and unusual objects that are not explained by current models in astronomy; see, e.g. [1, 2, 3, 4, 5].
Past approaches have included use of a mixture model [2]. Other works use the density of unlabeled points around a point known to belong to the rare class in order to to detect a cluster from the rare class. For example, in [4] a local-density-differential-sampling strategy was used. In [6] on the other hand, a graph-based approach using similarity matrices was used to capture changes in density.
One key distinction between the rare-category detection problem and the problem of, for example finding outliers of a dataset, is that a rare class is generally not assumed to actually be separable from the majority class (at least in the given features). This makes classification challenging, and we are forced to rely heavily on the small number of labeled points which we know belong to the rare class. When points from such rare classes are not separable (at least with their given features) from majority classes, they often have other characteristics that can help us to identify whether an unlabeled point is likely to belong to the rare class. In this paper we are interested in cases where rare and majority classes have different geometry or “shape.” Such a situation is plausible when the distribution and variance of the majority and rare classes are driven by different processes. In the most extreme (but not unusual) case, two classes will have different geometries if they have different intrinsic dimensions.
We use a statistic called the -profile [7] which can be calculated for a set of points . Very roughly, the -profile measures how well can be projected into a range of subspaces of varying dimension in such a way as to best satisfy an optimization problem (1). Among other things, it is an effective tool for estimating the intrinsic dimension of a dataset. In our algorithm, which we call the -detection algorithm, we use a comparison of the -profile of a cluster of rare class points against the -profile of the same cluster with the inclusion of an unlabeled point, as a metric by which to determine whether that unlabeled point belongs to the rare class. The underlying assumption is that even if an unlabeled point is very close to a cluster of rare class points, if it does not agree with the geometry of this cluster then it is probably not a point from the rare class. Because the -detection algorithm draws heavily from characteristics related to the intrinsic dimension of a dataset we expect it to actually perform better on data sampled from a higher ambient dimension because in higher dimensions there is likely more dimension-related information to extract.
This paper is structured as follows. In Section II we summarize background information on secant-based dimensionality-reduction algorithms, the concept of a -profile, and the context and set of assumptions we make in the rare-category detection problem. In Section III we present the -detection algorithm and a method for determining a key threshold parameter in this algorithm. In Section IV we apply the -detection algorithm to both real and synthetic examples. Finally in Section V we describe some future directions.
II Background
II-A Secant-based dimensionality reduction
In most applications, a reasonable dimensionality-reduction algorithm should preserve the distance between two data points in their ambient space. Such a goal can be equivalently stated as the requirement that the secant set of a dataset be preserved during dimensionality reduction. In this paper we will choose to work with the normalized secant set ,
[TABLE]
The purpose of normalization is to give equal footing to both large- and small-scale structure. When working with real data, it is often useful to discard the very smallest secants as these are most affected by noise.
The dimensionality-reduction algorithms which underlie this paper all attempt to solve the secant-based optimization problem:
[TABLE]
Here is the collection of all matrices (with ) whose columns are orthonormal. Note that this set is equivalent to the set of orthogonal -projections from to . Roughly, (1) attempts to find the projection onto a -dimensional subspace such that the length of the secant which is least well-preserved is maximized. This is in contrast to principal component analysis (PCA) for example, which solves a different optimization problem. As a result, a solution to (1) frequently differs from the corresponding PCA solution. Problem (1) is closely tied to the intrinsic dimension of via the constructive proof of the Whitney Embedding Theorem from differential topology [8, Theorem 6.15]. This fact will be a key aspect of the algorithm proposed in this paper.
There are fast, lightweight, iterative algorithms that converge to local optima for (1) [9], [10]. For the experiments in this paper we used the SAP algorithm from [9]. The SAP algorithm is well-adapted to working with rare categories because it scales well to high dimensional data though less well to large numbers of points. In the case of rare categories the latter is not an issue by assumption.
II-B The -profile
In this section we review the concept of -values and -profiles. The -profile statistic will be the primary tool that we will use to determine whether an unlabeled point conforms to the known geometry or “shape” of a rare class.
The notion of a -value was first defined in [11]. Such values arise from solutions to (1). Specifically, let be the projection that satisfies (1) for some projection dimension ; then the -value is defined as
[TABLE]
Note that because we assume that the elements of our secant set have been normalized, it is always the case that . Suppose that is an increasing sequence of integers such that for each (we will generally assume that the ’s are consecutive but they need not be). Then the -profile associated with is the tuple of values
[TABLE]
In analogy to the singular values produced when applying PCA, the -profile tells us something about how well our dataset can be projected into lower-dimensional spaces. The information provided by the -profile however is more sensitive to the intrinsic dimension of the dataset (see Section II.C in [7]).
In Figure 1 we plot the -profiles for points drawn from several different manifolds (shown as solid curves in the figure), where all manifolds are smoothly mapped into Specifically, in Figure 1, we show
- •
the -profile for a set of points drawn randomly from a -dimensional torus mapped smoothly into .
- •
The -profile for a set of points drawn randomly from the real projective plane and mapped smoothly into .
- •
The -profile for a set of points drawn randomly from the 3-sphere and mapped smoothly into .
- •
The -profile for a set of random Gaussian noise in .
As can be seen, the relationship between the -profiles in this figure reflect the intrinsic dimension of the manifolds from which each set of points was drawn. The torus and are both 2-dimensional manifolds and this is reflected by the fact that the -values for the associated sets of points grow the fastest. On the other hand, the -values for the 3-sphere (which is a 3-dimensional manifold) grow more slowly. Finally, the set of points drawn from the multivariate Gaussian distribution in grows the slowest reflecting the fact that the intrinsic dimension of this dataset really is . In general, the -profile for a set of points with lower intrinsic dimension should sit above the -profile for a set of points with higher intrinsic dimension.
Because the -profile is sensitive to changes in dimension, if we sample a point randomly from and include this in any of , , or , we should expect the -profiles of , , and to be noticeably different from the original -profile for , , or . To see this, compare the solid and dashed lines of each color in Figure 1. The dashed lines are exactly the -profiles of , , , . As can be seen, just adding a single point which lies off the original manifold gives a -profile that better matches that of Gaussian noise rather than the original -profile for the manifold.
II-C The rare-category detection problem
In our version of the rare-category detection problem, we assume that we are given a dataset
[TABLE]
where consists of labeled points known to belong to a majority class, consists of labeled points known to belong to a rare class, and consists of unlabeled points. The goal of our algorithm will be to classify whether each point in belongs to the rare class or the majority class.
The algorithm that we describe in Section III is designed to cope with two of the major challenges of this problem.
The number of training points belonging to the rare class , which we train on, is potentially very small (for example, less than ). 2. 2.
The classes may not be separable in the given features.
III The -detection algorithm
In this section we describe the -detection algorithm, which utilizes the -profile described in Section II-B in order to classify unlabeled points as either belonging to the majority class or the rare class.
The basic idea behind this algorithm is that one way to gauge whether an unlabeled point belongs to a rare class is whether its inclusion in the rare class substantially changes the class’s geometry. Our proxy statistic to detect whether “changes the geometry” of the rare class is the -profile. Specifically, we assume we have been handed a set of labeled rare class points and a set that we want to classify as either belonging to or not. We start by calculating the -profile of without including any unlabeled points. We next iterate through each point and calculate the -profile of . We calculate the change in the -profile using the -norm
[TABLE]
If is below a user-specified threshold we label as a point in ; otherwise we label it as a majority point. The algorithm as a whole is outlined in Algorithm 1.
There are several parameters which need to be tuned for Algorithm 1 to perform well. The first is the threshold, . In Section III-A we propose a data-driven algorithm for determining Of course, the appropriate choice of threshold may differ depending on the application. In some applications it is more important to avoid false positives while in other situations false negatives are worse. In the former case we should pick a smaller threshold and in the latter we should pick a larger threshold.
The optimization problem (1) is non-convex and therefore one is unlikely to actually find the global solution . Which maximum is found is based on the initial projection that is used as well as the step-size in the SAP algorithm (for a discussion of these parameters, see [9]). Because the SAP algorithm is relatively fast, for more accurate approximations of the -profiles of and of , one can choose to compute each of these times, where is a parameter chosen by the user; then we take the pointwise average of the -profiles and call these or respectively. In the experiments in this paper we generally took .
Finally, this paper rests on the basic assumption that a class of points in a dataset approximately sits on a -dimensional manifold embedded in with . This identification is never exact because of noise in the dataset. In terms of secants, this noise will have a much more significant effect on short secants. For this reason, when applying Algorithm 1 to real data we advocate discarding the shortest secants when calculating -profiles.
III-A Algorithmic determination of thresholds
The -detection algorithm requires a threshold value which determines the extent to which a point is allowed to alter the -profile of the rare class before we say that this point is not an element of the rare class. We have left this as a parameter to be tuned by the user because the choice between a higher or lower threshold should be based on the application.
We do, however, present an algorithm, Algorithm 2, that generates a rough threshold (which can be further refined to the specific dataset) based on what is already known about the rare class. The idea of the algorithm is that one should try to understand how much variation in is introduced by each point which we already know belongs to . If we find that on average, for each the -profile for is significantly different than the -profile to , then we should not be surprised that for belonging to the rare class, the -profile of might differ significantly from the -profile for .
Algorithm 2 begins by calculating the -profile for the labeled points from the rare class, . Next we iterate through all and for each we calculate the -profile of . We set
[TABLE]
Finally we take the pointwise average over all for and call this vector . We have found that in practice, a good starting threshold is where . By assumption contains few points, so Algorithm 2 is generally fast.
III-B Limitations
The -detection algorithm is not without limitations. The most obvious of these is that our algorithm demands at least enough labeled rare class points to estimate the geometry of the rare class. In particular, the number of rare class points must be at least equal to (and preferably more than) , where is the dimension of the manifold on which approximately sits. The value is generally not known, however in practice we have found that is a safe assumption for all but the largest and most varied classes. In general one can obtain a reasonably good estimation of the -profile of a dataset even from small subsamples. We conjecture that this is related to the fact that the number of secants grows as as a function of the number of points in our sample.
The second potential limitation of Algorithm 1 is that it will not perform well when the underlying geometry is the same for different classes. Such a phenomenon can arise in cases where class distinctions are artificial. Imagine for example we are trying to label the integer age of adults in a dataset via their physical measurements. The age of 34 might be a rare class, but the underlying dynamics that relate the physical characteristics of a person to their age are probably not very different between individuals who are 33 and those who are 34. On the other hand, the distinction between protein-binding locations on E. coli cells is not artificial and therefore it would not be surprising if the features for different binding locations encode different geometries. See Section IV-B for performance of the -detection algorithm on a dataset fitting this description.
Because the -detection algorithm utilizes features of data (its geometry for example) that to our knowledge are not used in other rare-category detection algorithms, we believe that it will function particularly well as part of an ensemble of methods. We further expect that by including the strengths of other approaches, the limitations described above will be minimized.
IV Real and synthetic examples
In this section we describe the performance of the -detection algorithm on both synthetic and real data.
IV-A A synthetic example
We begin by applying the -detection algorithm to a simple synthetic dataset. This dataset is the union of two sets of points: points corresponding to the majority class and points corresponding to the rare class . The set is itself the union of 6 sets where is a set of points drawn from a multivariate normal distribution centered at the origin with covariance matrix equal to a diagonal matrix with entries everywhere on the diagonal except for a value of in the entry at .
The rare class is formed from random points drawn from the -dimensional trigonometric moment curve:
[TABLE]
A projection of points sampled from is shown in Figure 2. As can be seen, the image of is indeed intrinsically -dimensional. A projection of the whole dataset into 3-dimensions is shown in Figure 3. The blue points are from the majority class, the red points are from the rare class. In this case, for ease of viewing we have not differentiated between points which are labeled and those which are not.
In this synthetic example, the intrinsic dimension of the rare class is (it is a curve) while the intrinsic dimension of the majority class is . Given this significant difference in dimension, we would expect that a random point from the majority class would (even if it is very close to points from the rare class spatially) with high probability be off of the curve and therefore on average result in a large change in the -profile. This is what we see in Figure 4. Here we have plotted a histogram for a range of values (see (2)) when is actually a point from the rare class (orange) or when is a point from the majority class (blue). Clearly if one were to set a threshold of one would be able to separate points that come from the rare class and points that come from the majority class reasonably well.
This synthetic example is also useful for illustrating why solving the optimization problem (1) is essential to the performance of the algorithm. Superficially, it might seem that the -profile could be replaced by singular values. After all, both of these statistics measure how well data can be projected into different dimensions. We plot the singular values for and in Figure 5. Observe that these curves look very similar despite the fact that is drawn from a -dimensional manifold and is drawn from a -dimensional manifold. On the other hand the -profiles in Figure 6 look quite distinct and reflect the differences in dimension between and . It is easy to see why replacing the -profile in Algorithm 1 with singular values would severely limit the algorithm’s ability to detect geometry. Only the -profile can detect the dimension despite the non-linearities of the datasets.
IV-B Real datasets
In this section we apply of the -detection algorithm to four real-world datasets. The results are summarized in Table 7. Each dataset consists of a number of different imbalanced classes. In each setting, we did the following.
- •
We chose a class with relatively few points and called this the ‘rare class.’ We designated the union of all of the rest of the classes as the ‘majority class.’
- •
For all datasets other than the shuttle dataset where we used , we used Algorithm 2 to determine a threshold. When applying Algorithm 2, we set where .
- •
We ran the -detection algorithm times for each dataset with a new random partition of the rare and majority classes into labeled and unlabeled points. In Table 7 we record the average percentage of unlabeled rare class points that the algorithm correctly identified, as well as the average percentage of unlabeled majority class points that the algorithm misidentified as rare.
As can be seen, very few labeled rare class points (between and depending on the dataset) were required to achieve reasonable classification results. Note that in each case, one could improve the values in the second column of Table 7 by decreasing the threshold parameter at the expense of increasing the number of majority points misclassified as rare in the third column. Our implementation was intended to be balanced with respect to this trade-off. Finally, the reader should keep in mind that because the classes are imbalanced in all of these datasets, the percentages in the second and third column of Table 7 can correspond to very different absolute numbers of points.
Below we provide additional analysis and discussion of algorithm performance for three of the four datasets.
IV-B1 The E. coli dataset
The E. coli dataset111https://archive.ics.uci.edu/ml/datasets/ecoli [12], [13] consists of different classes which are related to protein localization sites on E. coli cells. The classes in this dataset vary in size, with the largest including 143 points. We choose to study the smaller class with label ‘om’ which only contains 20 points. Figure 8 shows a projection of the data set with the majority class (all points other than those in class ‘om’) labeled with blue circles and all those in the rare class ‘om’ labeled with red triangles. From this projection of the data it appears that the rare class may very approximately sit on a 2-dimensional surface while the majority class does not. This makes this rare class a good candidate for -detection. The observation that the class ‘om’ is intrinsically close to 2-dimensional is reinforced by the -plots for the different classes in this dataset shown in Figure 9. This plot gives further evidence that ‘om’ has lower intrinsic dimension than other classes. Notice for example that the first values in the -profile of ‘om’ are larger than all the classes other than those with very few points (the number of points in the class is given in the legend).
A histogram of values for for one random choice of is shown in Figure 10. In contrast to the synthetic example, here we see that the two classes of points are not separable in the histogram. However, we do see a concentration of points from the rare class with low value.
IV-B2 The page block dataset
The page block dataset222https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification [14, 13] consists of data points in corresponding to blocks in page layouts from distinct documents. The coordinates of the points are features related to each particular block: height, length, percentage of black pixels in the block, etc. Each point is labeled by one of different types of content contained in the block: ‘text’, ‘horizontal line’, ‘graphic’, ‘vertical line’, or ‘picture’. We chose the ‘horizontal line’ class (containing points) as our rare class and took all other classes together to be the majority class. As can been seen from the -profiles (calculated from a subset of this dataset) in Figure 11, the small classes ‘graphic’, ‘vertical line’, and ‘picture’ are all likely -dimensional (whether this is because these classes are intrinsically -dimensional or we simply don’t have enough data points to estimate the dimension is unknown). The rare class ‘horizontal line’ appears to be to -dimensional while the largest class ‘text’ is at least -dimensional.
The projection of the rare class into via PCA, Figure 13, suggests that there are two data points that are outliers which disturb the approximate -dimensionality of this dataset. After excluding these points from the set of labeled rare class points, the percentage of rare class points identified went from 70% to 77%, while the percentage of majority class points misidentified as rare went from 40% to 27%. This is an example of the effect that labeled outlier rare class points can have on the performance of the algorithm.
We show a representative histogram of values for a single run of this dataset trained on rare-class points. We note that unlike some of the other datasets, the rare and majority classes of the page block dataset were not separable in terms of the range of values of . Nevertheless, by discarding all unlabeled points with corresponding value above a well-chosen threshold, one can significantly reduce the pool of potential rare class points that must be further evaluated even when only a very small number of labeled rare class points are known.
IV-B3 The glass dataset
The glass dataset333https://archive.ics.uci.edu/ml/datasets/glass+identification [13] consists of data points in . The points each represent a sample of glass, the classes are given by the sample’s use (window or non-window glass for example) and whether it was float processed or not. Finally the features are chemical and physical properties of the sample. We chose the class, ‘float processed vehicle window glass’ for our rare category. This class contains 17 points out of the points.
Of all the datasets that we tested our algorithm on, this dataset had the highest rate of misclassification of majority class points as rare class points. A representative histogram of values for one run of -detection on this dataset is shown in Figure 14. Here we see that a significant number of the unlabeled majority are situated such that their inclusion into the rare class minimally disturbs its -profile. This is supported by inspection of this dataset projected into using PCA (Figure 15). At least in this projection, it appears that a cluster of points from the majority class sit along the same approximate 2-dimensional surface on which the rare class sits. This is another reminder that the -detection algorithm is limited by the geometry of the points it is given. If the data manifold for two classes coincide in a significant way, we should expect limited classification accuracy for points in those regions.
V Conclusion
In this paper we propose a new approach to the rare-category detection problem, which, given a small set of labeled points from a rare category, finds others based on geometric/dimensionality considerations.
There are a number of directions which would be interesting to explore in the future.
- •
Visual inspection indicates that many of the errors made by the -detection algorithm are due to noise. At the moment the only tool we have applied to address this is to discard small secants. It would be useful to develop more sophisticated methods.
- •
Are there more appropriate norms for measuring change in the -profile? Are there certain coordinates in the -profile that we should pay particular attention to?
- •
How well does the -detection algorithm contribute to ensemble techniques in the case when majority and rare classes have the same underlying dimension?
- •
While in this paper we focus on using -detection to make a binary classification of an unlabeled point as belonging to a rare class or not, one could also use the value directly. In a future work, we plan to investigate how the values can be used to calculate probabilities that the point belongs to a given data manifold.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Dokas, L. Ertoz, V. Kumar, A. Lazarevic, J. Srivastava, and P.-N. Tan, “Data mining for network intrusion detection,” in Proc. NSF Workshop on Next Generation Data Mining , 2002, pp. 21–30.
- 2[2] D. Pelleg and A. W. Moore, “Active learning for anomaly and rare-category detection,” in Advances in neural information processing systems , 2005, pp. 1073–1080.
- 3[3] S. Bay, K. Kumaraswamy, M. G. Anderle, R. Kumar, and D. M. Steier, “Large scale detection of irregularities in accounting data,” in Data Mining, 2006. ICDM’06. Sixth International Conference on . IEEE, 2006, pp. 75–86.
- 4[4] J. He and J. G. Carbonell, “Nearest-neighbor-based active learning for rare category detection,” in Advances in neural information processing systems , 2008, pp. 633–640.
- 5[5] J. He, H. Tong, and J. Carbonell, “Rare category characterization,” in Data Mining (ICDM), 2010 IEEE 10th International Conference on . IEEE, 2010, pp. 226–235.
- 6[6] J. He, Y. Liu, and R. Lawrence, “Graph-based rare category detection,” in 2008 Eighth IEEE International Conference on Data Mining . IEEE, 2008, pp. 833–838.
- 7[7] H. Kvinge, E. Farnell, M. Kirby, and C. Peterson, “Monitoring the shape of weather, soundscapes, and dynamical systems: a new statistic for dimension-driven data analysis on large datasets,” in 2018 IEEE International Conference on Big Data (Big Data) . IEEE, 2018, pp. 1045–1051.
- 8[8] J. M. Lee, “Smooth manifolds,” in Introduction to Smooth Manifolds . Springer, 2013, pp. 1–31.
