Efficient estimation of AUC in a sliding window

Nikolaj Tatti

arXiv:1902.00632·cs.LG·February 5, 2019

Efficient estimation of AUC in a sliding window

Nikolaj Tatti

PDF

TL;DR

This paper introduces an efficient algorithm for approximating the AUC in a sliding window over data streams, significantly reducing computation time while maintaining high accuracy, which is crucial for real-time system monitoring.

Contribution

The paper presents a novel grouping-based algorithm for approximate AUC estimation in sliding windows, achieving faster updates with controlled error margins.

Findings

01

The proposed method reduces update time from O(k) to O((log k)/ε).

02

Experimental results show the approximation error is often much smaller than the theoretical bound.

03

Significant speed-ups are achieved with only a modest decrease in accuracy.

Abstract

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length $k$ . More specifically, we propose an algorithm that, given $ϵ$ , estimates AUC within $ϵ /2$ , and can maintain this estimate in $O ((lo g k) / ϵ)$ time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires $O (k)$ time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is…

Tables1

Table 1. Table 1 : Basic characteristics of the benchmark datasets.

Dataset	size of training dataset	size of test dataset
Hepmass	$500 000$	$3 500 000$
Miniboone	$30 064$	$100 000$
Tvads	$40 265$	$89 420$

Equations24

n (s) = ∣ {i ∣ s_{i} = s, ℓ_{i} = 0} ∣ and p (s) = ∣ {i ∣ s_{i} = s, ℓ_{i} = 1} ∣

n (s) = ∣ {i ∣ s_{i} = s, ℓ_{i} = 0} ∣ and p (s) = ∣ {i ∣ s_{i} = s, ℓ_{i} = 1} ∣

auc = \frac{1}{A} s \sum (hp (s) + \frac{1}{2} p (s)) n (s),

auc = \frac{1}{A} s \sum (hp (s) + \frac{1}{2} p (s)) n (s),

B = {w \in T ∣ s (u) \leq s (w) < s (v)}

B = {w \in T ∣ s (u) \leq s (w) < s (v)}

gp (u; L) = w \in B \sum p (w) and gn (u; L) = w \in B \sum n (w)

gp (u; L) = w \in B \sum p (w) and gn (u; L) = w \in B \sum n (w)

hp (v) = v \in T ∣ s (v) < s \sum p (v) and hn (v) = v \in T ∣ s (v) < s \sum n (v) .

hp (v) = v \in T ∣ s (v) < s \sum p (v) and hn (v) = v \in T ∣ s (v) < s \sum n (v) .

hp (w) \leq α (hp (v) + p (v)),

hp (w) \leq α (hp (v) + p (v)),

hp (u) > α (hp (v) + p (v)) .

hp (u) > α (hp (v) + p (v)) .

c_{v} = \frac{1}{2} (hp (u) + p (u) + hp (w)) .

c_{v} = \frac{1}{2} (hp (u) + p (u) + hp (w)) .

auc \sim = \frac{1}{A} v \in L \sum (hp (v) + \frac{1}{2} p (v)) n (v) + v \in T ∖ L \sum c_{v} n (v) .

auc \sim = \frac{1}{A} v \in L \sum (hp (v) + \frac{1}{2} p (v)) n (v) + v \in T ∖ L \sum c_{v} n (v) .

∣ b - c_{v} ∣ \leq \frac{1}{2} (hp (w) - hp (u) - p (u)) \leq \frac{ϵ}{2} (hp (u) + p (u)) \leq \frac{ϵ b}{2},

∣ b - c_{v} ∣ \leq \frac{1}{2} (hp (w) - hp (u) - p (u)) \leq \frac{ϵ}{2} (hp (u) + p (u)) \leq \frac{ϵ b}{2},

c_{w}^{'} = c_{w} + 1 \leq α (c_{v} + b_{v} + 1) \leq α (c_{v}^{'} + b_{v}^{'} + 1) = α (c_{u}^{'} + 1) \leq α (c_{u}^{'} + b_{u}^{'}),

c_{w}^{'} = c_{w} + 1 \leq α (c_{v} + b_{v} + 1) \leq α (c_{v}^{'} + b_{v}^{'} + 1) = α (c_{u}^{'} + 1) \leq α (c_{u}^{'} + b_{u}^{'}),

c_{w}^{'} \leq c_{w} \leq α (c_{v} + b_{v}) \leq α (c_{v}^{'} + b_{v}^{'} + 1) = α (c_{u}^{'} + 1) \leq α (c_{u}^{'} + b_{u}^{'}) .

c_{w}^{'} \leq c_{w} \leq α (c_{v} + b_{v}) \leq α (c_{v}^{'} + b_{v}^{'} + 1) = α (c_{u}^{'} + 1) \leq α (c_{u}^{'} + b_{u}^{'}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: F-Secure, Helsinki, Finland, 11email: [email protected]

Efficient estimation of AUC in a sliding window

Nikolaj Tatti

Abstract

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant.

In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length $k$ . More specifically, we propose an algorithm that, given $\epsilon$ , estimates AUC within $\epsilon/2$ , and can maintain this estimate in $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires $\mathit{\mathcal{O}}\mathopen{}\left(k\right)$ time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that ( $i$ ) the groups are small enough, so that the error stays small, ( $ii$ ) the number of groups is small, so that enumerating them is not expensive, and ( $iii$ ) the definition is flexible enough so that we can maintain the groups efficiently.

Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee $\epsilon/2$ , and that we can achieve significant speed-ups with only a modest sacrifice in accuracy.

Keywords:

AUC approximation guarantee sliding window

1 Introduction

Consider monitoring prediction performance in a stream of data points. That is, we first receive a data point $d$ without the label, and we predict the missing label with a score of $s$ , after the prediction we receive the true label $\ell$ . We are interested in monitoring how well $s$ predicts $\ell$ as the stream evolves over time.

A good example of such a task is a monitoring system for corporate computers that detects abnormal behavior based on event logs. Here the positive label represents an abnormal event that requires a closer inspection, and such a label can be given, for example, by an expert or triggered automatically. The produced score can be used for decision making, and can be a specific feature or a simple statistic, or the result of some classifier, such as logistic regression. It is vital to monitor such a system continuously to notice breakdowns early. Possible causes may be changes in the underlying distribution or a system failure, due to the software update.

A natural choice to monitor the predictive power of a real-valued score is the area under the ROC curve (AUC) in a sliding window over the stream of events as proposed by Brzezinski and Stefanowski [5]. Unfortunately, maintaining the exact AUC requires $\mathit{\mathcal{O}}\mathopen{}\left(k\right)$ time, per new event, where $k$ is the size of the window. This may be too expensive if $k$ is large and the rate of the events is significant.

In this paper we propose a technique for estimating AUC efficiently in a sliding window. Namely, we propose an approximation scheme that has $\epsilon/2$ approximation error guarantee while having $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ update time. That is, the scheme provides a trade-off between the accuracy and computational complexity.

Our approach is straightforward. Computing AUC exactly requires sorting data points and summing over all data points (see Eq. 1 for the exact formula). Maintaining points sorted can be done using binary search trees. However, estimating the sum requires additional tricks. We approach the problem by grouping neighboring data points together, that is, treating them as if the classifier given them the same score.

The key step is to design a grouping such that 3 properties hold at the same time: ( $i$ ) the groups are small enough so that the relative error is small, more specifically, ${\left|\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}-\mathit{auc}\right|}/\mathit{auc}\leq\epsilon/2$ , ( $ii$ ) the number of groups is small enough, more specifically, it should be in $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ , and ( $iii$ ) the definition should be flexible enough so that we can do quick updates whenever points arrive or leave the sliding window.

Roughly speaking, in order to accommodate all 3 demands, we will maintain the groups with the two following properties: ( $i$ ) the number of positive labels in a group is less than or equal to $(1+\epsilon)$ than the total number of positive labels in all the previous groups, ( $ii$ ) the number of positive labels in a group, and the next group, is larger than $(1+\epsilon)$ than the total number of positive labels in all the previous groups. The first property will yield the approximation guarantee, while the second property guarantees that the number of groups remains small. Moreover, these properties are flexible enough so we can perform update procedures quickly.

The rest of the paper is organized as follows. We begin by reminding ourselves the definition of AUC in Section 2. Updating the groups of data points quickly requires several auxiliary structures, which we introduce in Section 3. We then proceed describing AUC estimation in Section 4. The related work is given in Section 5. In Section 6, we demonstrate that the relative error in practice is much smaller than the guaranteed bound, as well as, study the trade-off between the error and the computational cost. Finally, we conclude the paper with discussion in Section 7.

2 Preliminaries

We start with the definition of AUC, and provide a formula for computing it.

Assume that we are given a set of $k$ pairs $W=(s_{i},\ell_{i})_{i}^{k}$ , where $\ell_{i}$ is the true label of the $i$ th instance, $\ell_{i}=0,1$ , and $s_{i}$ is score produced by the classification algorithm. The larger $s_{i}$ , the more we believe that $\ell_{i}$ should be [math].111We chose this direction due to the notational convenience.

In order to predict a label, we need a threshold $\sigma$ , and predict that $\ell_{i}=0$ if $s_{i}\geq\sigma$ , and $\ell_{i}=1$ otherwise. The ROC curve is obtained by varying $\sigma$ and plotting true positive rate as a function of false positive rate. AUC is the area under the ROC curve. To compute AUC, we can use the following formula. Let

[TABLE]

be the counts of labels with a score of $s$ . Define also $\mathit{hp}\mathopen{}\left(s\right)=\sum_{t<s}\mathit{p}\mathopen{}\left(t\right)$ . Then,

[TABLE]

where $A={\left|\left\{i\mid\ell_{i}=0\right\}\right|}{\left|\left\{i\mid\ell_{i}=1\right\}\right|}$ is the normalization factor. Eq. 1 can be computed in $\mathit{\mathcal{O}}\mathopen{}\left(k\log k+k\right)$ time by first sorting $W$ , computing $\mathit{hp}$ , and enumerating over the sum of Eq. 1.

In a streaming setting, $W$ is a sliding window, and our goal is to compute AUC as $W$ slides over a stream of predictions and labels.

3 Supporting data structures for estimating AUC

In this section we introduce supporting data structures that are needed to compute AUC in a streaming setting. Additional structures and the actual logic for computing AUC are given in the next section. We begin by describing the data structures, then follow with introducing the needed query operations, and finally finish with explaining the update procedures.

3.1 Data structures

Assume that we have a sequence of pairs $W=(s_{i},\ell_{i})_{i=1}^{k}$ , where $s_{i}$ is the score produced by the classifier, and $\ell_{i}\in\left\{0,1\right\}$ is the true label.

We store $W$ in a red-black tree $T$ sorted by the scores $s_{i}$ . Let $v\in T$ be a node in $T$ . We will denote the corresponding score of $v$ by $\mathit{s}\mathopen{}\left(v\right)$ . We store and maintain the following information:

•

Counter $\mathit{p}\mathopen{}\left(v\right)={\left|\left\{i\mid s_{i}=\mathit{s}\mathopen{}\left(v\right),\ell_{i}=1\right\}\right|}$ , number of pairs in $W$ with a score $\mathit{s}\mathopen{}\left(v\right)$ and a positive label.

•

Counter $\mathit{n}\mathopen{}\left(v\right)={\left|\left\{i\mid s_{i}=\mathit{s}\mathopen{}\left(v\right),\ell_{i}=0\right\}\right|}$ , number of pairs in $W$ with a score $\mathit{s}\mathopen{}\left(v\right)$ and a negative label.

•

Counter $\mathit{accpos}\mathopen{}\left(v\right)$ , the total sum of $\mathit{p}\mathopen{}\left(w\right)$ , where $w$ ranges over all descendant nodes of $v$ in $T$ , including $v$ itself.

•

Counter $\mathit{accneg}\mathopen{}\left(v\right)$ , the total sum of $\mathit{n}\mathopen{}\left(w\right)$ , where $w$ ranges over all descendant nodes of $v$ in $T$ , including $v$ itself.

For simplicity, we will add two sentinel nodes to $T$ . The first node will have a score of $-\infty$ and the second node has a score $\infty$ . We will assume that the actual entries will never achieve these values. Both sentinel nodes have 0 positive labels and 0 negative labels.

Note that if the scores $s_{i}$ are unique, then we have either $\mathit{p}\mathopen{}\left(v\right)=1$ , $\mathit{n}\mathopen{}\left(v\right)=0$ , or $\mathit{p}\mathopen{}\left(v\right)=0$ , $\mathit{n}\mathopen{}\left(v\right)=1$ . However, if there are duplicate scores, then we may have any integer combinations.

In addition to red-black trees, we need to maintain several linked lists, for which we will now introduce the notation. Assume that we are given a subset $U$ of nodes in $T$ . We would like to maintain $U$ in a linked list $L$ , sorted by the score. For that we will need two pointers for each node $u\in U$ , namely, $\mathit{next}\mathopen{}\left(u;L\right)$ indicating the next node in $L$ , and $\mathit{prev}\mathopen{}\left(u;L\right)$ indicating the previous node in $L$ . Let $u\in U$ and assume that $v=\mathit{next}\mathopen{}\left(u;L\right)$ exists. Let

[TABLE]

be the set of nodes in $T$ between $u$ and $v$ . We define

[TABLE]

to be the total sums of the labels in the gap $B$ . We will refer to $L$ as weighted linked list. Note that deleting an element from $L$ and maintaining the gap counters can be done in constant time. We will refer to the deletion algorithm by $\textsc{Remove}(L,v)$ . Moreover, adding a new element, say $v$ , to $L$ after $u$ can be also done in constant time, if we already know the total sums of labels, say $p$ and $n$ , between $u$ and $v$ . We will refer to the insertion algorithm by $\textsc{Add}(L,u,v,p,n)$ .

We say that the node $v\in T$ is positive, if $\mathit{p}\mathopen{}\left(v\right)>0$ . Similarly, we say that the node $v$ is negative, if $\mathit{n}\mathopen{}\left(v\right)>0$ . Note that $v$ can be both negative and positive.

We maintain all positive nodes in a weighted linked list, which we will refer as $P$ . Finally, we also store all positive nodes in its own dedicated red-black tree, denoted by $\mathit{TP}$ . For simplicity, we also store the sentinel nodes of $T$ in $P$ and $\mathit{TP}$ as the first and the last nodes.

3.2 Query procedures

The first query that we need is $\textsc{MaxPos}(s)$ , returning the positive node $v$ with the largest score such that $\mathit{s}\mathopen{}\left(v\right)\leq s$ . This can be done in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time using $\mathit{TP}$ , where $k$ is the number of elements in the window.

Maintaining $\mathit{accpos}\mathopen{}\left(v\right)$ and $\mathit{accneg}\mathopen{}\left(v\right)$ allows us to query a cumulative sums of counts. Specifically, given a score $s$ , we are interested in

[TABLE]

We can compute both of these sums with $\textsc{HeadStats}(s)$ , given in Algorithm 1.

The algorithm assumes that there is a node in $T$ containing $s$ , and proceeds to find it; during the search whenever we go the right branch we add the accumulative sums from the left branch. We omit the trivial proof of correctness. Since the tree is balanced, the running time of $\textsc{HeadStats}(s)$ is $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ , where $k$ is the number of entries in the window.

3.3 Update procedures

We now continue to the maintenance procedures as we slide the window. This comes down to two procedures: ( $i$ ) removing an entry from the window and ( $ii$ ) adding an entry to the window.

We will first describe removing an entry with a positive label and a score $s$ . First we will find the node, say $v$ , with the score $s$ , and reduce the counter $\mathit{p}\mathopen{}\left(v\right)$ by 1. We will need to update the $\mathit{accpos}$ counters. However, we only need to do it for the ancestors of $v$ , and there are only $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ of them, where $k$ is the number of entries in the window, since $T$ is balanced. We also reduce $\mathit{gp}\mathopen{}\left(v;P\right)$ by 1. In the process, $v$ may become non-positive, and we need to delete it from $\mathit{TP}$ as well as from $P$ .

Finally, if $\mathit{p}\mathopen{}\left(v\right)=\mathit{n}\mathopen{}\left(v\right)=0$ , we need to delete the node from $T$ . This may result in rebalancing of the tree, and during the balancing we need to make sure that the counters $\mathit{accpos}$ and $\mathit{accneg}$ are properly updated. Luckily, the red-black tree balancing is based on left and right rotations. During these rotations it is easy to maintain the counters without additional costs.

We will refer to this procedure as $\textsc{RemoveTreePos}(s)$ and the pseudo-code is given in Algorithm 2. $\textsc{RemoveTreePos}(s)$ runs in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time.

Deleting an entry with a negative label and a score $s$ is simpler. First, we find the node, say $v$ , with the score $s$ , and reduce the $\mathit{n}\mathopen{}\left(v\right)$ counter by 1. If needed, we delete $v$ from $T$ . Finally we use $\textsc{MaxPos}(s)$ to find $u$ , the largest positive node with $\mathit{s}\mathopen{}\left(u\right)\leq u$ , and reduce $\mathit{gn}\mathopen{}\left(u;P\right)$ by 1. The procedure, referred as RemoveTreeNeg, runs in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time.

Next, we will describe the addition of a positive entry with a score $s$ . First, we will add the entry $s$ to $T$ , possibly creating a new node in the process. Let $v$ be the node in $T$ with the score $s$ .

If $v$ is a new node, then we need to add it to the weighted linked list $P$ . First, we find the node, say $w=\textsc{MaxPos}(s)$ , after which $v$ is supposed to be added. We need to compute the new gap counter $\mathit{gn}\mathopen{}\left(v;P\right)$ . By definition, this value is equal to the total count of negative labels of nodes between $w$ and $v$ , including $w$ . Thus, this new gap counter is equal to $\mathit{hn}\mathopen{}\left(w\right)-\mathit{hn}\mathopen{}\left(v\right)$ . Both counters can be obtained using HeadStats in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time.

We will refer to this procedure as $\textsc{AddTreePos}(s)$ , and the pseudo-code is given in Algorithm 3. $\textsc{AddTreePos}(s)$ runs in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time.

Adding an entry with negative label and a score $s$ is simpler. First, we will add the entry $s$ to $T$ , possibly creating a new node in the process. Let $v$ be the node in $T$ with a score $s$ . Then, we use $\textsc{MaxPos}(s)$ to find $u$ , the largest positive node with $\mathit{s}\mathopen{}\left(u\right)\leq u$ , and increase $\mathit{gn}\mathopen{}\left(u\right)$ by 1. The procedure, referred as AddTreeNeg, runs in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time.

4 Estimating AUC efficiently

In order to approximate AUC, we will use Eq. 1 as a basis. However, instead of enumerating over every node we will enumerate only over some selected nodes. The key is how to select the nodes such that we will obtain the approximation guarantee while keeping the number of nodes small.

We will maintain a weighted linked list $\mathit{C}$ . Given $\alpha>1$ , we say that $\mathit{C}$ is $\alpha$ -compressed, if for every two consecutive nodes in $\mathit{C}$ , say $v$ and $w$ , it holds that

[TABLE]

and if $u=\mathit{next}\mathopen{}\left(w;\mathit{C}\right)$ exists, then

[TABLE]

Eq. 3 will yield the approximation guarantee, while the Eq. 4 will guarantee the running time.

4.1 Computing approximate AUC

Our next step is to show how we can approximate AUC using a compressed list $L$ in $\mathit{\mathcal{O}}\mathopen{}\left(L\right)$ time. The idea is as follows. Let $B$ be the set of nodes between two consecutive nodes $v$ and $w$ in $L$ . Normally, we would have to go over each individual node in $B$ when computing AUC. Instead, we will group $B$ to a single node. We will use the total number of positive labels in $B$ , that is, $\mathit{gp}\mathopen{}\left(v;L\right)-\mathit{p}\mathopen{}\left(v\right)$ , for the number of positive labels for this node. Similarly, we will use $\mathit{gn}\mathopen{}\left(v;L\right)-\mathit{n}\mathopen{}\left(v\right)$ for the negative labels. The pseudo-code for the algorithm is given in Algorithm 4.

Let us first establish that ApproxAUC produces an accurate estimate.

Proposition 1

Let $L$ be $(1+\epsilon)$ -compressed list constructed from the search tree $T$ . Let $\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}=\textsc{ApproxAUC}(L)$ be an approximate AUC, and let $\mathit{auc}$ be the correct AUC. Then ${\left|\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}-\mathit{auc}\right|}\leq\epsilon\mathit{auc}/2$ .

Proof

Let $A$ be as defined in ApproxAUC. Let $v\in T$ be a node, and let $u$ be the node in $L$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)<\mathit{s}\mathopen{}\left(v\right)$ . Let $w=\mathit{next}\mathopen{}\left(u;L\right)$ be the next node. Define

[TABLE]

Then, ApproxAUC returns

[TABLE]

We will argue the approximation guarantee by comparing the terms in Eq. 1 and Eq. 5. Let $v$ be a node in $L$ . Then the corresponding term can be found in sums of both equations.

Let $v\in T\setminus L$ , and write $b=\mathit{hp}\mathopen{}\left(v\right)+\frac{1}{2}\mathit{p}\mathopen{}\left(v\right)$ . Let $u$ be the node in $L$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)\leq\mathit{s}\mathopen{}\left(v\right)$ . Let $w=\mathit{next}\mathopen{}\left(u;L\right)$ be the next node. By definition, we have $\mathit{hp}\mathopen{}\left(u\right)+\mathit{p}\mathopen{}\left(u\right)\leq b\leq\mathit{hp}\mathopen{}\left(w\right)$ . Since $c_{v}$ is the average of the lower bound and the upper bound, we have

[TABLE]

where the second inequality follows since $L$ is $(1+\epsilon)$ -compressed.

We have shown that the approximation holds for individual terms. Consequently, it holds for the summands $\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}$ and $\mathit{auc}$ , completing the proof. ∎

Two remarks are in order. First, since AUC is always smaller than 1, Proposition 1 implies that the approximation is also absolute, ${\left|\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}-\mathit{auc}\right|}\leq\epsilon/2$ . The relative approximation is more accurate if AUC is small. However, if AUC is close to 1, it may make sense to reverse the approximation guarantee, that is, modify the algorithm such that we have a guarantee of ${\left|\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}-\mathit{auc}\right|}\leq(1-\mathit{auc})\epsilon/2$ . This can be done by flipping the labels, and using $1-\textsc{ApproxAUC}(\mathit{C})$ as the estimate.

ApproxAUC runs in $\mathit{\mathcal{O}}\mathopen{}\left({\left|L\right|}\right)$ time. Next we establish that ${\left|L\right|}$ is small.

Proposition 2

Let $L$ be $(1+\epsilon)$ -compressed list. Then ${\left|L\right|}\in\mathit{\mathcal{O}}\mathopen{}\left(\frac{\log k}{\epsilon}\right)$ , where $k$ is the number of entries in the sliding window.

Proof

Write $L=u_{0},\ldots,u_{m}$ . Since $L$ is $(1+\epsilon)$ -compressed, $\mathit{hp}\mathopen{}\left(u_{2}\right)\geq 1$ and $\mathit{hp}\mathopen{}\left(u_{i+2}\right)>(1+\epsilon)\mathit{hp}\mathopen{}\left(u_{i}\right)$ . Since $\mathit{hp}\mathopen{}\left(u_{m}\right)\leq k$ , we have $(1+\epsilon)^{\left\lfloor m/2\right\rfloor-1}\leq k$ . Solving for $m$ leads to $m\in\mathit{\mathcal{O}}\mathopen{}\left(\frac{\log k}{\log{1+\epsilon}}\right)\subseteq\mathit{\mathcal{O}}\mathopen{}\left(\frac{\log k}{\epsilon}\right)$ . ∎

4.2 Updating the data structures

Our final step is to describe procedures for maintaining $\mathit{C}$ as the data window slides. In the previous section, we already described how to update the search trees $T$ and $\mathit{TP}$ as well as the weighed linked list $P$ . Our next step is to make sure that the weighted linked list $\mathit{C}$ stays $\alpha$ -compressed.

We will need two utility routines. The first routine, AddNext, given in Algorithm 5, takes as input a node included in both $P$ and $\mathit{C}$ , and adds to $\mathit{C}$ the next node in $P$ . This procedure will be used extensively to add extra nodes to $\mathit{C}$ so that Eq. 3 is satisfied.

Next, we demonstrate how AddNext enforces Eq. 3.

Lemma 1

Assume that a linked list $L$ satisfies Eq. 3 for consecutive positive nodes $v$ and $w$ . Add or remove a single positive entry with a score $s$ , and assume that $v$ and $w$ are still positive. Let $u$ be the next positive node from $v$ in $P$ , and let $L^{\prime}$ be the list obtained from $L$ by adding a positive node $u$ . Then Eq. 3 holds for $L^{\prime}$ for the nodes $v$ and $u$ as well as for the nodes $u$ and $w$ .

Proof

Let us write $c_{x}=\mathit{hp}\mathopen{}\left(x\right)$ before modifyng $T$ , and $c_{x}^{\prime}=\mathit{hp}\mathopen{}\left(x\right)$ after the modification. Similarly, write $b_{x}=\mathit{p}\mathopen{}\left(x\right)$ before the modification, and $b_{x}^{\prime}=\mathit{p}\mathopen{}\left(x\right)$ after the modification.

Since $u$ is the next positive node of $v$ , we have $c_{u}^{\prime}=c_{v}^{\prime}+b_{v}^{\prime}\leq\alpha(c_{v}^{\prime}+b_{v}^{\prime})$ , proving the case of $v$ and $u$ .

If $s\geq\mathit{s}\mathopen{}\left(w\right)$ , then $c_{w}^{\prime}=c_{w}\leq\alpha(c_{v}+b_{v})=\alpha c_{u}=\alpha c_{u}^{\prime}\leq\alpha(c_{u}^{\prime}+b_{u}^{\prime})$ .

If we are adding $s$ and $s<\mathit{s}\mathopen{}\left(w\right)$ , then

[TABLE]

where the last inequality holds since $u$ is a positive node.

If we are removing $s$ and $s<\mathit{s}\mathopen{}\left(w\right)$ , then $c_{v}+b_{v}-1\leq c_{v}^{\prime}+b_{v}^{\prime}$ , and so

[TABLE]

This proves the case for $u$ and $w$ , and completes the proof.∎

Note that the execution of AddNext is done in constant time, the key step for this being able to obtain $\mathit{gp}\mathopen{}\left(v,P\right)=\mathit{p}\mathopen{}\left(v\right)$ and $\mathit{gn}\mathopen{}\left(v,P\right)$ in constant time. This is the main reason why we maintain $P$ .

While the first utility algorithm adds new entries to $\mathit{C}$ , our second utility algorithm, Compress, given in Algorithm 6 tries to delete as many entries as possible. It assumes that the input list $\mathit{C}$ already satisfies Eq. 3, and searches for violations of Eq. 4. Whenever such violation is found, the algorithm proceeds deleting the middle node. Note that deleting this node will not violate Eq. 3. Consequently, upon termination, the resulted linked list will be $\alpha$ -compressed. The computational complexity of $\textsc{Compress}(\mathit{C},\alpha)$ is $\mathit{\mathcal{O}}\mathopen{}\left({\left|\mathit{C}\right|}\right)$ .

Next, we describe the update steps. We will start with the easier ones:

Adding negative entry:

Given a negative entry with a score $s$ , we first invoke AddTreeNeg. Then we search $u\in\mathit{C}$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)\leq s$ . Once this entry is found, we increase $\mathit{gn}\mathopen{}\left(u;\mathit{C}\right)$ by 1.

Removing negative entry:

Given a negative entry with a score $s$ , we first invoke RemoveTreeNeg. Then we search $u\in\mathit{C}$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)\leq s$ . Once this entry is found, we decrease $\mathit{gn}\mathopen{}\left(u;\mathit{C}\right)$ by 1.

Since the positive labels are not modified, $\mathit{C}$ remains $\alpha$ -compressed, so there is no need for modifying $\mathit{C}$ . The running time for both routines is $\mathit{\mathcal{O}}\mathopen{}\left(\log k+\frac{\log k}{\epsilon}\right)$ .

Let us now consider more complex cases:

Adding positive entry:

Given a positive entry with a score $s$ , we first invoke AddTreePos. Then we search $u\in\mathit{C}$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)\leq s$ . Once this entry is found, we increase $\mathit{gp}\mathopen{}\left(u;\mathit{C}\right)$ by 1. By doing so, we may have violated Eq. 3 for $u$ . Lemma 1 states that we can correct the problem by adding the next positive node for each violation. However, a closer inspection of the proof shows that there can be only one violation, namely $u$ . Consequently, we check if Eq. 3 holds for $u$ , and if it fails, we add the next positive node by invoking $\textsc{AddNext}(u,\mathit{C},P)$ . Finally, we call $\textsc{Compress}(\mathit{C},\alpha)$ to force Eq. 4; ensuring that $\mathit{C}$ is $\alpha$ -compressed. The pseudo-code for AddPos is given in Algorithm 7.

Removing positive entry:

Assume that we are given a positive entry with a score $s$ . First we search $u\in\mathit{C}$ with the largest score such that $\mathit{s}\mathopen{}\left(u\right)\leq s$ . Once this entry is found, we decrease $\mathit{gp}\mathopen{}\left(u;\mathit{C}\right)$ by 1. If $u$ is no longer positive, we add the next positive entry to $\mathit{C}$ and delete $u$ from $\mathit{C}$ . The reason for this is explained later. We proceed by deleting the entry from the search trees with RemoveTreePos.

Next we make sure that Eq. 3 holds for every consecutive nodes $v$ and $w$ . There are two possible cases: ( $i$ ) $v$ and $w$ were consecutive nodes in $\mathit{C}$ before the deletion, or ( $ii$ ) $u$ was deleted from $\mathit{C}$ , and $w$ was the next positive node before the deletion. In the first case, Lemma 1 guarantees that using AddNext forces Eq. 3. In the second case, note that $\mathit{hp}\mathopen{}\left(w\right)$ after the deletion is equal to $\mathit{hp}\mathopen{}\left(u\right)$ before the deletion of $u$ . This implies that since Eq. 3 held for $v$ and $u$ before the deletion, Eq. 3 holds for $v$ and $w$ after the deletion. Finally, we enforce Eq. 4 with Compress. The pseudo-code for RemovePos is given in Algorithm 8.

In both routines, modifying the search trees is done in $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time, while modifying $\mathit{C}$ is done in $\mathit{\mathcal{O}}\mathopen{}\left({\left|\mathit{C}\right|}\right)\subseteq\mathit{\mathcal{O}}\mathopen{}\left(\frac{\log k}{\epsilon}\right)$ time.

5 Related work

The closest related work is a study by Bouckaert [3], where the author divided the ROC curve area into bins, allowing only to maintain the counters for individual bins. However, the number of the bins as well as the bins were static, and no direct approximation guarantees were provided.

Using AUC in a streaming setting was proposed in a paper by Brzezinski and Stefanowski [5]. Here the authors use red-black tree, similar to $T$ , to maintain the order of the data points in a sliding window, but they recompute the AUC from scratch every time, leading to a update time of $\mathit{\mathcal{O}}\mathopen{}\left(k+\log k\right)$ . In fact, our approach is essentially equivalent to their approach if we set $\epsilon=0$ .

Note that using AUC is useful if we do not have a threshold to binarize the score. If we do have such a threshold, then we can easily maintain a confusion matrix, which allows us to compute many metrics, such as, accuracy, recall, $F1$ -measure [9, 8], and Kappa-statistic [2, 13]. However, determining such a threshold may be extremely difficult since it depends on the misclassification costs. Selecting such costs may come down to a(n educated) guess.

We based our AUC calculation on a sliding window, that is, we abruptly forget the data points after certain period of time. The other option is to gradually forget the data points, for example using an exponential decay (see a survey by Gama et al. [10] for such examples). There are currently no methodology for efficiently estimating AUC under exponential decay, and this is a promising future line of work.

In a related line of work, training a classifier by optimizing AUC in a static setting has been proposed by Ataman et al. [1], Ferri et al. [7], Brefeld and Scheffer [4], Herschtal and Raskutti [12]. Here, AUC is used as an optimization criterion, and needs to be recomputed from scratch in $\mathit{\mathcal{O}}\mathopen{}\left({\left|D\right|}\log{\left|D\right|}\right)$ time. Naturally, this may be too expensive for large databases. Calders and Jaroszewicz [6] estimated AUC as a continuous function. This allowed to view AUC as a smooth function, and optimize the parameters of the underlying classifier efficiently using gradient descent techniques. While the underlying problem is the same as ours, that is, computing AUC from scratch is expensive, the maintenance procedures make problems orthogonal: in our settings we are required to do updates when a single data point leaves or enters to our window, whereas here AUC needs to be recomputed since the scores (and the order) for all existing data points have changed. However, it may be possible and fruitful to use similar tricks in order to speed-up the AUC calculation when optimizing classifiers. We leave this as a future line of work.

Hand [11] proposed a fascinating alternative for AUC. Namely, the author views AUC as the optimal classification loss averaged (with weights) over misclassification cost ratio. He then argues that AUC evaluates incoherently, namely the cost ratio weights depend on the ROC curve, and then he proposes a different coherent alternative. The computation of proposed metric, though more complex, shares some similarity with AUC, and it may be possible to use similar techniques as in this paper to approximate this measure efficiently in a stream.

6 Experimental evaluation

In this section we present our experimental evaluation. We have two goals: to demonstrate the relative error in practice as a function of the guaranteed error, and to demonstrate the trade-off between the computational cost and the error.

We implemented calculation of AUC using C++, and conducted the experiments using Macbook Air (1.6 GHz Intel Core i5 / 8 GB Memory).222See https://bitbucket.org/orlyanalytics/streamauc for the implementation. As a classifier we used Python’s scikit implementation of logistic regression. Computing AUC was done in a separate job from training the classifier as well as scoring new data points; the reported running times measure only the computation of AUC over the whole test data.

We used 3 UCI datasets333https://archive.ics.uci.edu/ for our experiments, see Table 1: ( $i$ ) Hepmass, a dataset containing features from simulated particle collisions, split in training and test datasets. We used the Hepmass-1000 variant. Due to the memory restrictions of Python, we only used a sample of $500\,000$ data points from training data. We used the whole test dataset. ( $ii$ ) Miniboone: a data used to distinguish electron neutrinos from muon neutrinos. Since the original data has data points ordered by label, we permuted the dataset and split it to training and test data. ( $iii$ ) Tvads: a data containing features for identifying commercials from TV news channels. We used BBC and CNN channels as training data, and the remaining channels as test data.

Actual error vs. guarantee: Proposition 1 states that the error cannot be more than $\epsilon/2$ . First, we test the actual relative error, that is, ${\left|\stackrel{{\scriptstyle\sim}}{{\smash{\mathit{auc}}\rule{0.0pt}{2.15277pt}}}-\mathit{auc}\right|}/\mathit{auc}$ as a function of $\epsilon$ . Here we set the sliding window size to be $1000$ .

The top row of Figure 1 shows the relative error, averaged over all sliding windows, and the bottom row of Figure 1 shows the relative error, maximized over all sliding windows. From the results we see that both maximum and average error are smaller than the guaranteed. Especially, the average error is typically smaller of several orders than the theoretical guarantee. As expected, both errors tend to increase as $\epsilon$ increases.

Computational cost vs. error: Next, we test the trade-off between the computational cost and the relative error. The top row of Figure 2 shows the running time as a function of the average error, while the bottom row of Figure 2 shows the size of $(1+\epsilon)$ -compressed list as a function of the average error. Here, we used a window size of $1000$ .

From the results, we see the trade-off between the error and the running time: as the error increases, the running time drops. This is mainly due to the fewer elements in the compressed list as demonstrated in the bottom row. The running stabilizes for larger errors; this is due to the operations that do not depend on $\epsilon$ , such as maintaining binary tree $T$ .

Computational cost vs. window size: Computing exact AUC requires $\mathit{\mathcal{O}}\mathopen{}\left(k\right)$ time while estimating AUC is $\mathit{\mathcal{O}}\mathopen{}\left(\log k/\epsilon\right)$ . Consequently, the speed-up should increase as the size of the sliding window increases. We demonstrate this effect in Figure 3 using the Miniboone dataset. We see that the speed-up increases as a function of window size: computing estimates using $\epsilon=0.1$ is 17 times faster for a window size of $10\,000$ .

7 Concluding remarks

In this paper we introduced an approximation scheme that allows to maintain an estimate AUC in a sliding window within the guaranteed relative error of $\epsilon/2$ in $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ time. The key idea behind the estimator is to group the data points. The grouping has to be done cleverly so that the error stays small, the number of groups stay small, and the list can be updated quickly. We achieve this by maintaining groups, where the number of positive labels can only increase relatively by $(1+\epsilon)$ within one group, and must increase by at least $(1+\epsilon)$ within two groups. Our experimental evaluation suggests that the average error in practice is much smaller than the guaranteed approximation, and that we can achieve significant speed-up, especially as the window size grows.

Our algorithm relies on the fact that the data points have no weights, specifically, Lemma 1 relies on the fact that the update may change the counters only by 1. If the data points are weighted, a different approach is required: It is possible to construct $(1+\epsilon)$ -list from a scratch. The key idea here is a new query, where, given a threshold $\sigma$ , we look for a node $v$ that has the largest $\mathit{hp}\mathopen{}\left(v\right)$ such that $\mathit{hp}\mathopen{}\left(v\right)\leq\sigma$ . This query can be done using the same trick as in HeadStats, and it requires $\mathit{\mathcal{O}}\mathopen{}\left(\log k\right)$ time. The list can be then constructed by calling this query with exponentially increasing thresholds $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ times. This leads to a running time of $\mathit{\mathcal{O}}\mathopen{}\left((\log^{2}k)/\epsilon\right)$ . An interesting direction for future work is to improve this complexity to, say, $\mathit{\mathcal{O}}\mathopen{}\left((\log k)/\epsilon\right)$ .

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ataman et al. [2006] Ataman, K., Streetr, W., Zhang, Y.: Learning to rank by maximizing auc with linear programming. In: Neural Networks, 2006. IJCNN’06. International Joint Conference on. pp. 123–129. IEEE (2006)
2Bifet and Frank [2010] Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data. In: Discovery Science. pp. 1–15. Springer (2010)
3Bouckaert [2006] Bouckaert, R.R.: Efficient AUC learning curve calculation. In: Australasian Joint Conference on Artificial Intelligence. pp. 181–191 (2006)
4Brefeld and Scheffer [2005] Brefeld, U., Scheffer, T.: Auc maximizing support vector learning. In: Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning (2005)
5Brzezinski and Stefanowski [2017] Brzezinski, D., Stefanowski, J.: Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. KAIS 52(2), 531–562 (2017)
6Calders and Jaroszewicz [2007] Calders, T., Jaroszewicz, S.: Efficient AUC optimization for classification. In: PKDD. pp. 42–53 (2007)
7Ferri et al. [2002] Ferri, C., Flach, P., Hernández-Orallo, J.: Learning decision trees using the area under the roc curve. In: ICML. vol. 2, pp. 139–146 (2002)
8Gama [2010] Gama, J.: Knowledge discovery from data streams. CRC Press (2010)