Efficient estimation of AUC in a sliding window
Nikolaj Tatti

TL;DR
This paper introduces an efficient algorithm for approximating the AUC in a sliding window over data streams, significantly reducing computation time while maintaining high accuracy, which is crucial for real-time system monitoring.
Contribution
The paper presents a novel grouping-based algorithm for approximate AUC estimation in sliding windows, achieving faster updates with controlled error margins.
Findings
The proposed method reduces update time from O(k) to O((log k)/ε).
Experimental results show the approximation error is often much smaller than the theoretical bound.
Significant speed-ups are achieved with only a modest decrease in accuracy.
Abstract
In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length . More specifically, we propose an algorithm that, given , estimates AUC within , and can maintain this estimate in time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping isā¦
| Dataset | size of training dataset | size of test dataset |
|---|---|---|
| Hepmass | ||
| Miniboone | ||
| Tvads |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: F-Secure, Helsinki, Finland, 11email: [email protected]
Efficient estimation of AUC in a sliding window
Nikolaj Tatti
Abstract
In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant.
In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length . More specifically, we propose an algorithm that, given , estimates AUC within , and can maintain this estimate in time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that () the groups are small enough, so that the error stays small, () the number of groups is small, so that enumerating them is not expensive, and () the definition is flexible enough so that we can maintain the groups efficiently.
Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee , and that we can achieve significant speed-ups with only a modest sacrifice in accuracy.
Keywords:
AUC approximation guarantee sliding window
1 Introduction
Consider monitoring prediction performance in a stream of data points. That is, we first receive a data point without the label, and we predict the missing label with a score of , after the prediction we receive the true label . We are interested in monitoring how well predicts as the stream evolves over time.
A good example of such a task is a monitoring system for corporate computers that detects abnormal behavior based on event logs. Here the positive label represents an abnormal event that requires a closer inspection, and such a label can be given, for example, by an expert or triggered automatically. The produced score can be used for decision making, and can be a specific feature or a simple statistic, or the result of some classifier, such as logistic regression. It is vital to monitor such a system continuously to notice breakdowns early. Possible causes may be changes in the underlying distribution or a system failure, due to the software update.
A natural choice to monitor the predictive power of a real-valued score is the area under the ROC curve (AUC) in a sliding window over the stream of events as proposed byĀ Brzezinski and Stefanowski [5]. Unfortunately, maintaining the exact AUC requires time, per new event, where is the size of the window. This may be too expensive if is large and the rate of the events is significant.
In this paper we propose a technique for estimating AUC efficiently in a sliding window. Namely, we propose an approximation scheme that has approximation error guarantee while having update time. That is, the scheme provides a trade-off between the accuracy and computational complexity.
Our approach is straightforward. Computing AUC exactly requires sorting data points and summing over all data points (see Eq.Ā 1 for the exact formula). Maintaining points sorted can be done using binary search trees. However, estimating the sum requires additional tricks. We approach the problem by grouping neighboring data points together, that is, treating them as if the classifier given them the same score.
The key step is to design a grouping such that 3 properties hold at the same time: () the groups are small enough so that the relative error is small, more specifically, , () the number of groups is small enough, more specifically, it should be in , and () the definition should be flexible enough so that we can do quick updates whenever points arrive or leave the sliding window.
Roughly speaking, in order to accommodate all 3 demands, we will maintain the groups with the two following properties: () the number of positive labels in a group is less than or equal to than the total number of positive labels in all the previous groups, () the number of positive labels in a group, and the next group, is larger than than the total number of positive labels in all the previous groups. The first property will yield the approximation guarantee, while the second property guarantees that the number of groups remains small. Moreover, these properties are flexible enough so we can perform update procedures quickly.
The rest of the paper is organized as follows. We begin by reminding ourselves the definition of AUC in SectionĀ 2. Updating the groups of data points quickly requires several auxiliary structures, which we introduce in SectionĀ 3. We then proceed describing AUC estimation in SectionĀ 4. The related work is given in SectionĀ 5. In SectionĀ 6, we demonstrate that the relative error in practice is much smaller than the guaranteed bound, as well as, study the trade-off between the error and the computational cost. Finally, we conclude the paper with discussion in SectionĀ 7.
2 Preliminaries
We start with the definition of AUC, and provide a formula for computing it.
Assume that we are given a set of pairs , where is the true label of the th instance, , and is score produced by the classification algorithm. The larger , the more we believe that should be [math].ā111We chose this direction due to the notational convenience.
In order to predict a label, we need a threshold , and predict that if , and otherwise. The ROC curve is obtained by varying and plotting true positive rate as a function of false positive rate. AUC is the area under the ROC curve. To compute AUC, we can use the following formula. Let
[TABLE]
be the counts of labels with a score of . Define also . Then,
[TABLE]
where is the normalization factor. Eq.Ā 1 can be computed in time by first sorting , computing , and enumerating over the sum of Eq.Ā 1.
In a streaming setting, is a sliding window, and our goal is to compute AUC as slides over a stream of predictions and labels.
3 Supporting data structures for estimating AUC
In this section we introduce supporting data structures that are needed to compute AUC in a streaming setting. Additional structures and the actual logic for computing AUC are given in the next section. We begin by describing the data structures, then follow with introducing the needed query operations, and finally finish with explaining the update procedures.
3.1 Data structures
Assume that we have a sequence of pairs , where is the score produced by the classifier, and is the true label.
We store in a red-black tree sorted by the scores . Let be a node in . We will denote the corresponding score of by . We store and maintain the following information:
- ā¢
Counter , number of pairs in with a score and a positive label.
- ā¢
Counter , number of pairs in with a score and a negative label.
- ā¢
Counter , the total sum of , where ranges over all descendant nodes of in , including itself.
- ā¢
Counter , the total sum of , where ranges over all descendant nodes of in , including itself.
For simplicity, we will add two sentinel nodes to . The first node will have a score of and the second node has a score . We will assume that the actual entries will never achieve these values. Both sentinel nodes have 0 positive labels and 0 negative labels.
Note that if the scores are unique, then we have either , , or , . However, if there are duplicate scores, then we may have any integer combinations.
In addition to red-black trees, we need to maintain several linked lists, for which we will now introduce the notation. Assume that we are given a subset of nodes in . We would like to maintain in a linked list , sorted by the score. For that we will need two pointers for each node , namely, indicating the next node in , and indicating the previous node in . Let and assume that exists. Let
[TABLE]
be the set of nodes in between and . We define
[TABLE]
to be the total sums of the labels in the gap . We will refer to as weighted linked list. Note that deleting an element from and maintaining the gap counters can be done in constant time. We will refer to the deletion algorithm by . Moreover, adding a new element, say , to after can be also done in constant time, if we already know the total sums of labels, say and , between and . We will refer to the insertion algorithm by .
We say that the node is positive, if . Similarly, we say that the node is negative, if . Note that can be both negative and positive.
We maintain all positive nodes in a weighted linked list, which we will refer as . Finally, we also store all positive nodes in its own dedicated red-black tree, denoted by . For simplicity, we also store the sentinel nodes of in and as the first and the last nodes.
3.2 Query procedures
The first query that we need is , returning the positive node with the largest score such that . This can be done in time using , where is the number of elements in the window.
Maintaining and allows us to query a cumulative sums of counts. Specifically, given a score , we are interested in
[TABLE]
We can compute both of these sums with , given in AlgorithmĀ 1.
The algorithm assumes that there is a node in containing , and proceeds to find it; during the search whenever we go the right branch we add the accumulative sums from the left branch. We omit the trivial proof of correctness. Since the tree is balanced, the running time of is , where is the number of entries in the window.
3.3 Update procedures
We now continue to the maintenance procedures as we slide the window. This comes down to two procedures: () removing an entry from the window and () adding an entry to the window.
We will first describe removing an entry with a positive label and a score . First we will find the node, say , with the score , and reduce the counter by 1. We will need to update the counters. However, we only need to do it for the ancestors of , and there are only of them, where is the number of entries in the window, since is balanced. We also reduce by 1. In the process, may become non-positive, and we need to delete it from as well as from .
Finally, if , we need to delete the node from . This may result in rebalancing of the tree, and during the balancing we need to make sure that the counters and are properly updated. Luckily, the red-black tree balancing is based on left and right rotations. During these rotations it is easy to maintain the counters without additional costs.
We will refer to this procedure as and the pseudo-code is given in AlgorithmĀ 2. runs in time.
Deleting an entry with a negative label and a score is simpler. First, we find the node, say , with the score , and reduce the counter by 1. If needed, we delete from . Finally we use to find , the largest positive node with , and reduce by 1. The procedure, referred as RemoveTreeNeg, runs in time.
Next, we will describe the addition of a positive entry with a score . First, we will add the entry to , possibly creating a new node in the process. Let be the node in with the score .
If is a new node, then we need to add it to the weighted linked list . First, we find the node, say , after which is supposed to be added. We need to compute the new gap counter . By definition, this value is equal to the total count of negative labels of nodes between and , including . Thus, this new gap counter is equal to . Both counters can be obtained using HeadStats in time.
We will refer to this procedure as , and the pseudo-code is given in AlgorithmĀ 3. runs in time.
Adding an entry with negative label and a score is simpler. First, we will add the entry to , possibly creating a new node in the process. Let be the node in with a score . Then, we use to find , the largest positive node with , and increase by 1. The procedure, referred as AddTreeNeg, runs in time.
4 Estimating AUC efficiently
In order to approximate AUC, we will use Eq.Ā 1 as a basis. However, instead of enumerating over every node we will enumerate only over some selected nodes. The key is how to select the nodes such that we will obtain the approximation guarantee while keeping the number of nodes small.
We will maintain a weighted linked list . Given , we say that is -compressed, if for every two consecutive nodes in , say and , it holds that
[TABLE]
and if exists, then
[TABLE]
Eq.Ā 3 will yield the approximation guarantee, while the Eq.Ā 4 will guarantee the running time.
4.1 Computing approximate AUC
Our next step is to show how we can approximate AUC using a compressed list in time. The idea is as follows. Let be the set of nodes between two consecutive nodes and in . Normally, we would have to go over each individual node in when computing AUC. Instead, we will group to a single node. We will use the total number of positive labels in , that is, , for the number of positive labels for this node. Similarly, we will use for the negative labels. The pseudo-code for the algorithm is given in AlgorithmĀ 4.
Let us first establish that ApproxAUC produces an accurate estimate.
Proposition 1
Let be -compressed list constructed from the search tree . Let be an approximate AUC, and let be the correct AUC. Then .
Proof
Let be as defined in ApproxAUC. Let be a node, and let be the node in with the largest score such that . Let be the next node. Define
[TABLE]
Then, ApproxAUC returns
[TABLE]
We will argue the approximation guarantee by comparing the terms in Eq.Ā 1 and Eq.Ā 5. Let be a node in . Then the corresponding term can be found in sums of both equations.
Let , and write . Let be the node in with the largest score such that . Let be the next node. By definition, we have . Since is the average of the lower bound and the upper bound, we have
[TABLE]
where the second inequality follows since is -compressed.
We have shown that the approximation holds for individual terms. Consequently, it holds for the summands and , completing the proof. ā
Two remarks are in order. First, since AUC is always smaller than 1, PropositionĀ 1 implies that the approximation is also absolute, . The relative approximation is more accurate if AUC is small. However, if AUC is close to 1, it may make sense to reverse the approximation guarantee, that is, modify the algorithm such that we have a guarantee of . This can be done by flipping the labels, and using as the estimate.
ApproxAUC runs in time. Next we establish that is small.
Proposition 2
Let be -compressed list. Then , where is the number of entries in the sliding window.
Proof
Write . Since is -compressed, and . Since , we have . Solving for leads to . ā
4.2 Updating the data structures
Our final step is to describe procedures for maintaining as the data window slides. In the previous section, we already described how to update the search trees and as well as the weighed linked list . Our next step is to make sure that the weighted linked list stays -compressed.
We will need two utility routines. The first routine, AddNext, given in AlgorithmĀ 5, takes as input a node included in both and , and adds to the next node in . This procedure will be used extensively to add extra nodes to so that Eq.Ā 3 is satisfied.
Next, we demonstrate how AddNext enforces Eq.Ā 3.
Lemma 1
Assume that a linked list satisfies Eq.Ā 3 for consecutive positive nodes and . Add or remove a single positive entry with a score , and assume that and are still positive. Let be the next positive node from in , and let be the list obtained from by adding a positive node . Then Eq.Ā 3 holds for for the nodes and as well as for the nodes and .
Proof
Let us write before modifyng , and after the modification. Similarly, write before the modification, and after the modification.
Since is the next positive node of , we have , proving the case of and .
If , then .
If we are adding and , then
[TABLE]
where the last inequality holds since is a positive node.
If we are removing and , then , and so
[TABLE]
This proves the case for and , and completes the proof.ā
Note that the execution of AddNext is done in constant time, the key step for this being able to obtain and in constant time. This is the main reason why we maintain .
While the first utility algorithm adds new entries to , our second utility algorithm, Compress, given in AlgorithmĀ 6 tries to delete as many entries as possible. It assumes that the input list already satisfies Eq.Ā 3, and searches for violations of Eq.Ā 4. Whenever such violation is found, the algorithm proceeds deleting the middle node. Note that deleting this node will not violate Eq.Ā 3. Consequently, upon termination, the resulted linked list will be -compressed. The computational complexity of is .
Next, we describe the update steps. We will start with the easier ones:
Adding negative entry:
Given a negative entry with a score , we first invoke AddTreeNeg. Then we search with the largest score such that . Once this entry is found, we increase by 1.
Removing negative entry:
Given a negative entry with a score , we first invoke RemoveTreeNeg. Then we search with the largest score such that . Once this entry is found, we decrease by 1.
Since the positive labels are not modified, remains -compressed, so there is no need for modifying . The running time for both routines is .
Let us now consider more complex cases:
Adding positive entry:
Given a positive entry with a score , we first invoke AddTreePos. Then we search with the largest score such that . Once this entry is found, we increase by 1. By doing so, we may have violated Eq.Ā 3 for . LemmaĀ 1 states that we can correct the problem by adding the next positive node for each violation. However, a closer inspection of the proof shows that there can be only one violation, namely . Consequently, we check if Eq.Ā 3 holds for , and if it fails, we add the next positive node by invoking . Finally, we call to force Eq.Ā 4; ensuring that is -compressed. The pseudo-code for AddPos is given in AlgorithmĀ 7.
Removing positive entry:
Assume that we are given a positive entry with a score . First we search with the largest score such that . Once this entry is found, we decrease by 1. If is no longer positive, we add the next positive entry to and delete from . The reason for this is explained later. We proceed by deleting the entry from the search trees with RemoveTreePos.
Next we make sure that Eq.Ā 3 holds for every consecutive nodes and . There are two possible cases: () and were consecutive nodes in before the deletion, or () was deleted from , and was the next positive node before the deletion. In the first case, LemmaĀ 1 guarantees that using AddNext forces Eq.Ā 3. In the second case, note that after the deletion is equal to before the deletion of . This implies that since Eq.Ā 3 held for and before the deletion, Eq.Ā 3 holds for and after the deletion. Finally, we enforce Eq.Ā 4 with Compress. The pseudo-code for RemovePos is given in AlgorithmĀ 8.
In both routines, modifying the search trees is done in time, while modifying is done in time.
5 Related work
The closest related work is a study byĀ Bouckaert [3], where the author divided the ROC curve area into bins, allowing only to maintain the counters for individual bins. However, the number of the bins as well as the bins were static, and no direct approximation guarantees were provided.
Using AUC in a streaming setting was proposed in a paper by Brzezinski and Stefanowski [5]. Here the authors use red-black tree, similar to , to maintain the order of the data points in a sliding window, but they recompute the AUC from scratch every time, leading to a update time of . In fact, our approach is essentially equivalent to their approach if we set .
Note that using AUC is useful if we do not have a threshold to binarize the score. If we do have such a threshold, then we can easily maintain a confusion matrix, which allows us to compute many metrics, such as, accuracy, recall, -measureĀ [9, 8], and Kappa-statisticĀ [2, 13]. However, determining such a threshold may be extremely difficult since it depends on the misclassification costs. Selecting such costs may come down to a(n educated) guess.
We based our AUC calculation on a sliding window, that is, we abruptly forget the data points after certain period of time. The other option is to gradually forget the data points, for example using an exponential decay (see a survey byĀ Gama etĀ al. [10] for such examples). There are currently no methodology for efficiently estimating AUC under exponential decay, and this is a promising future line of work.
In a related line of work, training a classifier by optimizing AUC in a static setting has been proposed byĀ Ataman etĀ al. [1], Ferri etĀ al. [7], Brefeld and Scheffer [4], Herschtal and Raskutti [12]. Here, AUC is used as an optimization criterion, and needs to be recomputed from scratch in time. Naturally, this may be too expensive for large databases. Calders and Jaroszewicz [6] estimated AUC as a continuous function. This allowed to view AUC as a smooth function, and optimize the parameters of the underlying classifier efficiently using gradient descent techniques. While the underlying problem is the same as ours, that is, computing AUC from scratch is expensive, the maintenance procedures make problems orthogonal: in our settings we are required to do updates when a single data point leaves or enters to our window, whereas here AUC needs to be recomputed since the scores (and the order) for all existing data points have changed. However, it may be possible and fruitful to use similar tricks in order to speed-up the AUC calculation when optimizing classifiers. We leave this as a future line of work.
Hand [11] proposed a fascinating alternative for AUC. Namely, the author views AUC as the optimal classification loss averaged (with weights) over misclassification cost ratio. He then argues that AUC evaluates incoherently, namely the cost ratio weights depend on the ROC curve, and then he proposes a different coherent alternative. The computation of proposed metric, though more complex, shares some similarity with AUC, and it may be possible to use similar techniques as in this paper to approximate this measure efficiently in a stream.
6 Experimental evaluation
In this section we present our experimental evaluation. We have two goals: to demonstrate the relative error in practice as a function of the guaranteed error, and to demonstrate the trade-off between the computational cost and the error.
We implemented calculation of AUC using C++, and conducted the experiments using Macbook Air (1.6 GHz Intel Core i5 / 8 GB Memory).ā222See https://bitbucket.org/orlyanalytics/streamauc for the implementation. As a classifier we used Pythonās scikit implementation of logistic regression. Computing AUC was done in a separate job from training the classifier as well as scoring new data points; the reported running times measure only the computation of AUC over the whole test data.
We used 3 UCI datasets333https://archive.ics.uci.edu/ for our experiments, see TableĀ 1: () Hepmass, a dataset containing features from simulated particle collisions, split in training and test datasets. We used the Hepmass-1000 variant. Due to the memory restrictions of Python, we only used a sample of data points from training data. We used the whole test dataset. () Miniboone: a data used to distinguish electron neutrinos from muon neutrinos. Since the original data has data points ordered by label, we permuted the dataset and split it to training and test data. () Tvads: a data containing features for identifying commercials from TV news channels. We used BBC and CNN channels as training data, and the remaining channels as test data.
Actual error vs. guarantee: PropositionĀ 1 states that the error cannot be more than . First, we test the actual relative error, that is, as a function of . Here we set the sliding window size to be .
The top row of FigureĀ 1 shows the relative error, averaged over all sliding windows, and the bottom row of FigureĀ 1 shows the relative error, maximized over all sliding windows. From the results we see that both maximum and average error are smaller than the guaranteed. Especially, the average error is typically smaller of several orders than the theoretical guarantee. As expected, both errors tend to increase as increases.
Computational cost vs. error: Next, we test the trade-off between the computational cost and the relative error. The top row of FigureĀ 2 shows the running time as a function of the average error, while the bottom row of FigureĀ 2 shows the size of -compressed list as a function of the average error. Here, we used a window size of .
From the results, we see the trade-off between the error and the running time: as the error increases, the running time drops. This is mainly due to the fewer elements in the compressed list as demonstrated in the bottom row. The running stabilizes for larger errors; this is due to the operations that do not depend on , such as maintaining binary tree .
Computational cost vs. window size: Computing exact AUC requires time while estimating AUC is . Consequently, the speed-up should increase as the size of the sliding window increases. We demonstrate this effect in FigureĀ 3 using the Miniboone dataset. We see that the speed-up increases as a function of window size: computing estimates using is 17 times faster for a window size of .
7 Concluding remarks
In this paper we introduced an approximation scheme that allows to maintain an estimate AUC in a sliding window within the guaranteed relative error of in time. The key idea behind the estimator is to group the data points. The grouping has to be done cleverly so that the error stays small, the number of groups stay small, and the list can be updated quickly. We achieve this by maintaining groups, where the number of positive labels can only increase relatively by within one group, and must increase by at least within two groups. Our experimental evaluation suggests that the average error in practice is much smaller than the guaranteed approximation, and that we can achieve significant speed-up, especially as the window size grows.
Our algorithm relies on the fact that the data points have no weights, specifically, LemmaĀ 1 relies on the fact that the update may change the counters only by 1. If the data points are weighted, a different approach is required: It is possible to construct -list from a scratch. The key idea here is a new query, where, given a threshold , we look for a node that has the largest such that . This query can be done using the same trick as in HeadStats, and it requires time. The list can be then constructed by calling this query with exponentially increasing thresholds times. This leads to a running time of . An interesting direction for future work is to improve this complexity to, say, .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ataman et al. [2006] Ataman, K., Streetr, W., Zhang, Y.: Learning to rank by maximizing auc with linear programming. In: Neural Networks, 2006. IJCNNā06. International Joint Conference on. pp. 123ā129. IEEE (2006)
- 2Bifet and Frank [2010] Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data. In: Discovery Science. pp. 1ā15. Springer (2010)
- 3Bouckaert [2006] Bouckaert, R.R.: Efficient AUC learning curve calculation. In: Australasian Joint Conference on Artificial Intelligence. pp. 181ā191 (2006)
- 4Brefeld and Scheffer [2005] Brefeld, U., Scheffer, T.: Auc maximizing support vector learning. In: Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning (2005)
- 5Brzezinski and Stefanowski [2017] Brzezinski, D., Stefanowski, J.: Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. KAIS 52(2), 531ā562 (2017)
- 6Calders and Jaroszewicz [2007] Calders, T., Jaroszewicz, S.: Efficient AUC optimization for classification. In: PKDD. pp. 42ā53 (2007)
- 7Ferri et al. [2002] Ferri, C., Flach, P., HernĆ”ndez-Orallo, J.: Learning decision trees using the area under the roc curve. In: ICML. vol. 2, pp. 139ā146 (2002)
- 8Gama [2010] Gama, J.: Knowledge discovery from data streams. CRC Press (2010)
