Active Learning with Combinatorial Coverage

Sai Prathyush Katragadda; Tyler Cody; Peter Beling; Laura Freeman

arXiv:2302.14567·cs.LG·March 1, 2023

Active Learning with Combinatorial Coverage

Sai Prathyush Katragadda, Tyler Cody, Peter Beling, Laura Freeman

PDF

Open Access

TL;DR

This paper introduces a data-centric active learning approach using combinatorial coverage, improving transferability and reducing bias compared to traditional model-centric methods.

Contribution

It proposes novel active learning methods based on combinatorial coverage that address transferability and sampling bias issues.

Findings

01

Coverage-based methods enhance data transferability to new models.

02

The proposed approach achieves competitive sampling bias.

03

Experimental results validate the effectiveness of combinatorial coverage in active learning.

Abstract

Active learning is a practical field of machine learning that automates the process of selecting which data to label. Current methods are effective in reducing the burden of data labeling but are heavily model-reliant. This has led to the inability of sampled data to be transferred to new models as well as issues with sampling bias. Both issues are of crucial concern in machine learning deployment. We propose active learning methods utilizing combinatorial coverage to overcome these issues. The proposed methods are data-centric, as opposed to model-centric, and through our experiments we show that the inclusion of coverage in active learning leads to sampling data that tends to be the best in transferring to better performing models and has a competitive sampling bias compared to benchmark methods.

Tables5

Table 1. TABLE I : Data Set Information

Data Set	datapoints	Features	Batch Size	Number of Batches
Tic-Tac-Toe	957	9	100	8
Balance Scale	624	4	25	21
Car Evaluation	1727	6	100	15
Chess	28066	6	100	246
Nursery	12959	8	100	113
Monk	414	6	25	13

Table 2. TABLE II : AUC of F1 vs. Query Points Added For Each Method, Model, and Dataset

Active Learning Method	Model	Monk	Balance Scale	Car Evaluation	Tic-Tac-Toe	Nursery	Chess
Random Sampling	Random Forest Decision Tree SVM	221.60 210.46 219.71	462.63 426.47 488.30	1303.77 1509.68 1321.41	696.84 764.03 690.35	10499.40 11201.24 10997.11	9437.38 17051.97 11464.67
Uncertainty Sampling	Random Forest Decision Tree SVM	228.47 213.20 221.67	459.87 433.73 477.82	1332.55 1455.13 1305.52	714.21 758.59 710.00	10507.02 11126.62 10800.42	8972.98 15848.8 10537.07
Query by Committee	Random Forest Decision Tree SVM	224.90 217.30 222.03	477.55 415.52 489.51	1346.31 1491.37 1395.44	731.57 761.22 717.71	10430.06 11301.59 11056.61	9632.39 17196.40 11031.15
Information Density Sampling	Random Forest Decision Tree SVM	220.91 213.28 218.88	466.80 419.28 488.57	1374.51 1517.34 1403.19	712.80 760.35 703.33	10563.24 11299.12 10971.20	9511.56 17593.14 11048.85
Coverage Density Sampling	Random Forest Decision Tree SVM	219.68 211.20 222.81	453.42 408.87 484.27	1297.09 1507.07 1304.94	705.78 775.08 681.57	10390.65 11090.34 10969.15	9512.64 16704.74 11381.05
Informative Coverage Density Sampling	Random Forest Decision Tree SVM	222.02 209.71 221.73	445.56 409.81 479.83	1314.82 1505.03 1317.73	693.15 759.47 686.84	10376.79 11065.89 10970.84	9465.12 16720.74 11369.16
USWCD	Random Forest Decision Tree SVM	220.05 209.74 218.84	477.21 425.33 493.54	1388.95 1531.87 1404.94	716.84 778.59 711.05	10522.23 11245.97 11006.75	9485.49 17300.60 11482.56

Table 3. TABLE III : Percent Difference in AUC Between Best Method And Each Method on Original Model by Dataset (%)

Active Learning Method	Monk	Balance Scale	Car Evaluation	Tic-Tac-Toe	Nursery	Chess
Random Sampling	-3.01	-3.13	-6.13	-4.75	-0.61	-2.03
Uncertainty Sampling	0.00	-3.70	-4.06	-2.37	-0.53	-6.85
Query by Committee	-1.56	0.00	-3.07	0.00	-1.26	0.00
Information Density Sampling	-3.31	-2.25	-1.04	-2.57	0.00	-1.25
Coverage Density Sampling	-3.85	-5.05	-6.61	-3.53	-1.63	-1.24
Informative Coverage Density Sampling	-2.82	-6.70	-5.34	-5.25	-1.77	-1.74
USWCD	-3.69	-0.07	0.00	-2.01	-0.39	-1.53

Table 4. TABLE IV : Percent Difference in AUC from Random Sampling When Model Changes and Performance Increases (%)

Active Learning Method	Model	Balance Scale	Car Evaluation	Tic-Tac-Toe	Nursery	Chess
Uncertainty Sampling	Decision Tree SVM	-2.15	-3.61 -1.20	-.71	-.67 -1.79	-7.06 -8.09
Query by Committee	Decision Tree SVM	.25	-1.21 5.60	-.37	.9 .54	.85 -3.78
Information Density Sampling	Decision Tree SVM	.06	.51 6.19	-.48	.87 -.24	3.17 -3.63
Coverage Density Sampling	Decision Tree SVM	-.83	-.17 -1.25	1.45	-.99 -.25	-2.04 -.73
Informative Coverage Density Sampling	Decision Tree SVM	-1.74	-.31 -.28	-.6	-1.21 -.24	-1.94 -.83
USWCD	Decision Tree SVM	1.07	1.47 6.32	1.91	.40 .09	1.46 .16

Table 5. TABLE V : Median Normalized Area Under Curve of F1 vs. Query Points Added

Active Learning Method	Random Forest	Decision Tree	SVM
Random Sampling	.376	.632	.455
Uncertainty Sampling	.498	.129	.003
Query by Committee	.797	.622	.854
Information Density Sampling	.740	.641	.634
Coverage Density Sampling	.160	.344	.534
Informative Coverage Density Sampling	.097	.041	.406
USWCD	.779	.798	.908

Equations26

S D C C^{t} (D_{U}, D_{L}) = \frac{∣ D _{U} ^{t} ∖ D _{L} ^{t} ∣}{∣ D _{U} ^{t} ∣} .

S D C C^{t} (D_{U}, D_{L}) = \frac{∣ D _{U} ^{t} ∖ D _{L} ^{t} ∣}{∣ D _{U} ^{t} ∣} .

i arg max

i arg max

i \sum x_{i} \leq b

I (x_{i}) = c_{i} \frac{1}{U} s im (x, x_{i})

I (x_{i}) = c_{i} \frac{1}{U} s im (x, x_{i})

s im (x, x_{i}) = x \in x \sum \frac{x \cdot x _{i}}{∣ x ∣ ∣ x _{i} ∣}

s im (x, x_{i}) = x \in x \sum \frac{x \cdot x _{i}}{∣ x ∣ ∣ x _{i} ∣}

i arg max

i arg max

i \sum x_{i} \leq b

I (x_{i}) = H (x_{i}) c_{i}

I (x_{i}) = H (x_{i}) c_{i}

H (x_{i}) = - y \in Y \sum p (y_{i}) l o g (p (y_{i}))

H (x_{i}) = - y \in Y \sum p (y_{i}) l o g (p (y_{i}))

i arg max

i arg max

i \sum x_{i} \leq b

F 1 = 2 * \frac{P r ec i s i o n * R ec a l l}{P r ec i s i o n + R ec a l l}

F 1 = 2 * \frac{P r ec i s i o n * R ec a l l}{P r ec i s i o n + R ec a l l}

P r ec i s i o n = \frac{tp}{tp + f p}

P r ec i s i o n = \frac{tp}{tp + f p}

R ec a l l = \frac{tp}{tp + f n}

Sampling Bias = 1 - \frac{H _{D_{L}}}{H _{B a l an ce d}}

Sampling Bias = 1 - \frac{H _{D_{L}}}{H _{B a l an ce d}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Data Quality and Management

Full text

Active Learning with Combinatorial Coverage

Sai Prathyush Katragadda

Grado Department of Industrial and Systems Engineering

*Virginia Tech

*Blacksburg Virginia, USA

Tyler Cody, Peter Beling, Laura Freeman

Virginia Tech National Security Institute

*Virginia Tech

*Arlington Virginia, USA

Abstract

Active learning is a practical field of machine learning that automates the process of selecting which data to label. Current methods are effective in reducing the burden of data labeling but are heavily model-reliant. This has led to the inability of sampled data to be transferred to new models as well as issues with sampling bias. Both issues are of crucial concern in machine learning deployment. We propose active learning methods utilizing combinatorial coverage to overcome these issues. The proposed methods are data-centric, as opposed to model-centric, and through our experiments we show that the inclusion of coverage in active learning leads to sampling data that tends to be the best in transferring to better performing models and has a competitive sampling bias compared to benchmark methods.

Index Terms:

active learning, combinatorial coverage, combinatorial interaction testing

I Introduction

Data preprocessing, which involves data labeling, is often the most expensive aspect of deploying machine learning[1]. Active learning is a sub-field of machine learning that proposes a solution to this problem; it is concerned with selecting which unlabeled datapoints would be most beneficial to label and train on[2]. This is especially applicable when there is a large pool of unlabeled data, but there is a restriction on the number of datapoints that can be labeled due to budget or time constraints.

Active learning has found success in many applications including image segmentation [3], sequence labeling [4], medical image classification [5], cybersecurity [6, 7], and manufacturing [8]. Yet, active learning methods are heavily model-dependent, thus datapoints labeled for one model may not be effective for the training of other models [9, 10]. As pointed out by Paleyes and Urma, model selection is often an iterative process in machine learning deployment [11]. Therefore, for the use of active learning to be practical, the sampled datapoints should also be effective in training other models.

Current active learning methods are successful in particular applications, in other words, a specific method will fare well for some combination of dataset and model [9, 10]. The issue is that, given a dataset, identifying which active learning method is the most effective for a given model may defeat the purpose of applying active learning due to the required investment of resources. This is especially true in deployment scenarios where model type is being constantly updated, e.g., from decision tree to random forest to support vector machine, and so on.

The cause of this issue is the model dependency of active learning methods. Typically, points to be labeled are regarded as beneficial to a specific model. However, these labeled points may not be the ideal datapoints for training a different model. Additionally, these points may be from a particular area of the feature space, creating sampling bias. Several methods have been proposed in the literature to combat the sampling bias issue, few of which are generalizable to any model and none of which are model independent [4, 12, 13, 14, 15, 16, 17]. To our knowledge, no data-centric active learning methods have been proposed to sample data so that other models are applicable without resampling.

In short, active learning methods optimize data labeling for a given model, but struggle to sample data which is also effective for training other models. This issue is closely related to the sampling bias issue, which results from the model dependency of existing active learning methods. In this work we propose an active learning approach based on combinatorial coverage (CC) that is data-centric, can generalize to any model, samples data which can be effectively transferred to new models, and achieves improvements in sampling bias. We contribute three CC-based active learning methods:

•

coverage density sampling,

•

informative coverage density sampling, and

•

uncertainty sampling weighted by coverage density.

While combinatorial interaction testing (CIT) and CC are not widespread in machine learning, several applications have proven successful [18, 19, 20, 21, 22, 23]. We leverage these ideas to develop the proposed methods, and present their competitive performance in terms of classification performance of the trained model and different models as well as the advantages in sampling bias.

The rest of this paper is organized as follows. Next, background is given on active learning and CIT. Then, three CC-based active learning methods are proposed. Subsequently, the experimental design is described and results are presented. The results cover 6 publicly available data sets. Then, before concluding with a synopsis, results and future work are discussed.

II Background

II-A Active Learning

Active learning methods can be divided into three groups; membership query synthesis, stream based selective sampling, and pool based sampling [2]. In membership query synthesis the model can arbitrarily select datapoints to label. Stream based selective sampling involves the model receiving a stream of datapoints and deciding whether they should be labeled one at a time. Pool based sampling, which is the focus of this work, involves the model drawing a set of unlabeled samples to label from the entire pool of unlabeled samples available. Several popular and generalizable query strategies exist for active learning. Uncertainty sampling selects the data point the model is currently most uncertain about [24]. Query by committee selects points to label as those which a committee of classifiers most disagree on or are on average most uncertain about [25].

The dependency of active learning on the model being used has lead to issues in data transferability and sampling bias. The process of sampling data with respect to one model and using it for other models is shown in Figure 1, where data is sampled according to the initial model and that same pool of labeled data is used to train different models downstream in the model deployment life cycle. Solutions to transferability have not been proposed but there has been work done to look into the specifics of how well the data selected by popular methods transfers to other models [9, 26, 10, 27]. Lowell et al. find that, in generic classification problems, transferability of data sampled using uncertainty sampling is not guaranteed [9]. Baldridge and Miles share a similar finding; data generated by random sampling are more transferrable to other models than data generated by uncertainty sampling [26]. Tomanek and Morik draw a similar conclusion, but they find that data sampled using uncertainty sampling is transferable to other models for some tasks [10]. Pardakhti et al. also reference the inability of active learning methods to sample data that is effective for multiple models, but their work focuses on finding the optimal hyper parameters for a model given a data set and active learning method [27].

The dependency of active learning on the model being used also leads to issues with sampling bias. Several methods have been proposed in the literature to counteract this. To improve the sampling bias of any baseline sampling method Settles and Craven propose an information density method which is computationally expensive for large pools of unlabeled data [4]. Several other successful methods have been proposed to combat the sampling bias and robustness issues, but all these methods are designed to work with specific model types, e.g., with convolutional neural networks [12, 13, 28, 15, 16]. A generalizable method for reducing sampling bias is proposed by Elhamifar et al. [17], but it involves an optimization problem with $n^{2}$ variables where $n$ is the number of datapoints in the data set, so the method is not scalable to large data sets.

In this manuscript, the proposed CC-based active learning algorithms are compared to random sampling, uncertainty sampling [2], query by committee [25], and information density [4]. For pool based active learning, all methods must select some subset of datapoints from an (unlabeled) query set of data. We implement them as follows.

•

For random sampling, we assume a uniform distribution over all datapoints and select datapoints from the query set with equal probability.

•

For uncertainty sampling, we use entropy, defined as $-\sum_{y\in Y}p(y)log(p(y))$ where $y$ is a class and $Y$ is all classes. The model trained on currently labeled data is tested on the query set to determine the probability of each query datapoint belonging to each class. These probabilities are then used for the entropy calculation, and those datapoints for which the model has the highest entropy are selected from the query set.

•

For query by committee, we also use entropy. Entropy is used to determine which datapoint the committee of classifiers is, on average, most uncertain about. The committee is comprised of three classifiers: random forest, k-nearest neighbors, and logistic regression. The datapoints with the highest average entropy are selected from the query set.

•

For information density, as presented by Settles and Craven [4], we weight an informativeness measure by a similarity metric. We weight the entropy for a datapoint with its cosine similarity from the labeled data divided by the cardinality of the unlabeled set.

We compare these benchmark methods to the proposed methods using 6 open-source data sets that are described in Section IV.

II-B Combinatorial Interaction Testing

CIT stems from covering arrays, ultimately derived from the statistical field of design of experiments, and is principally concerned with designing tests that guarantee all interactions up to a certain level. In CIT, an interaction level is the number of system components for which possible interactions should be included in the test set. For example pairwise interaction testing, which is an interaction level of two, aims to design a test set with datapoints containing every possible interaction between the values of every two system components. A thorough review of CIT is provided by Nie and Leung [29].

CIT has been applied to several fields but has found a plethora of success in software testing. The application of CIT to software testing has proven capable of fault detection while minimizing the test set size requirements, as a majority of failures can be attributed to the interaction between few parameters [30].

The extension of CIT to machine learning involves treating the feature space being used to train and test the model as the system parameters. Values are then the specific values each feature can take. A $t$ -way interaction is the same as a $t$ -way value combination. This is defined as a $t$ -tuple of (feature, value) pairs. For example, a 3-way value combination for a car condition classification dataset could be a specific combination of values for mileage, age, and days since last inspection, e.g., ‘150,000 miles traveled’, ‘20 years old’, and ‘168 days since last inspection’.

An extension of CIT is CC, which is the proportion of possible t-way interactions which appear in a set [31]. ‘Covered’ combinations are those interactions which do appear in a set, and ‘not covered’ are those which do not appear in a set. As CC is concerned with the universe of all possible interactions, Lanus et al. extend CC to Set Difference Combinatorial Coverage (SDCC) [22]. SDCC is the proportion of interactions contained in one dataset but not in another, and is formally defined as follows.

Definition 1 ( $t$ -way Set Difference Combinatorial Coverage).

Let $D_{L}$ and $D_{U}$ be sets of data, and $D{{}_{L}}^{t}$ and $D{{}_{U}}^{t}$ be the corresponding $t$ -way sets of data. The set difference $D{{}_{U}}^{t}\setminus D{{}_{L}}^{t}$ gives the value combinations that are in $D{{}_{U}}^{t}$ but that are not in $D{{}_{L}}^{t}$ . The $t$ -way set difference combinatorial coverage is

[TABLE]

Kuhn et al.[21] show how combinatorial interactions can be used for explainable artificial intelligence. Combinatorial interactions has been used to better define the activity of hidden layers in deep learning [20], and CC has been used for the testing of deep learning models [18, 19]. CC has also been used as a holistic approach for training and testing of models [23]. Lanus et al. utilize SDCC for failure analysis of machine learning, and find that a dataset with greater coverage leads to a better performing model [22]. Cody et al. expand their experiments and apply SDCC to MNIST [23].

III Methods

The proposed query criterion relies heavily on SDCC, where the labeled dataset is considered $D_{L}$ and the unlabeled dataset $D_{U}$ . The query strategy involves finding those datapoints in the unlabeled pool which contain interactions not included in the labeled set upto an interaction level of 6 as upto this level is where a majority of software failures are found [30]. Those datapoints which contain a greater number of missing interactions are to have a higher priority for labeling. Once the hierarchy of datapoints to label has been determined, selection according to this hierarchy is done in three ways; coverage density sampling, informative coverage density sampling, and uncertainty sampling weighted by coverage density. These data-centric methods should aid in sampling points which allow for data transferability to new models as illustrated in Figure 1.

Algorithm 1 presents a method to determine coverage density given unlabeled and labeled datasets. As a data point from the query set can contain several missing interactions, the sum of the number of missing interactions it contains could be considered as the density of coverage at that point. Lower level interactions are expected to be associated with a greater number of classes than higher level interactions, so they should hold a greater weight. The weighting scheme that is proposed utilizes the decreasing function $\frac{1}{t}$ for $t=1,...,6$ where each t is the t interaction level. The coverage density of some point is then the weighted sum of all interactions contained in that data point. This density is used in determining which datapoints to query in the proposed methods. Coverage density is formally defined as follows.

Definition 2 (Coverage Density).

*Let $D_{L}$ and $D_{U}$ be sets of data, and let $j\in D_{L}$ and $i\in D_{U}$ . Also, let $j_{t}$ and $i_{t}$ be corresponding t-way set of data. Then coverage density of i at level t is $c_{i_{t}}$ = $\sum_{j\in D_{L}}{\frac{1}{t}\ \forall\ i_{t}\ \text{not in}\ j_{t}}$

Coverage Density of each $i\in D_{U}$ is $\sum_{t\in T}c_{i_{t}}$ , where T, the highest interaction level, is user specified.*

All proposed sampling methods use the same definition of variables. $x_{i}$ is a binary variable valued 1 if data point $i$ will be sent to the oracle for labeling and 0 otherwise. $c_{i}$ is the coverage density of data point i as previously defined, and $b$ is the budget or number of points we are allowed to select. The three proposed methods to sample points are presented in Definitions 3-5.

Definition 3 (Coverage Density Sampling).

Let $c_{i}$ and $b$ be given, the points selected are those with the highest coverage density:

[TABLE]

The second method weighs coverage density by similarity, which should protect against outliers. Cosine similarity is used as the measure of similarity between each query point and all other points, this is defined in Equation 1 where $\boldsymbol{x}$ represents all datapoints in both training and query sets.

Definition 4 (Informative Coverage Density Sampling).

Let $c_{i}$ and $b$ be given, the informativeness of a datapoint, $x_{i}$ , is coverage density by similarity, where U is the cardinality of the unlabeled set:

[TABLE]

Where similarity is cosine similarity:

[TABLE]

The datapoints selected should maximize the sum of informativeness:

[TABLE]

The final method involves weighting the common uncertainty sampling with entropy formulation by the coverage density of the data point as defined previously.

Definition 5 (Uncertainty Sampling Weighted by Coverage Density(USWCD)).

Let $c_{i}$ and $b$ be given, the informativeness of a data point, $x_{i}$ , is defined as the following:

[TABLE]

Where $H(x_{i})$ is the entropy of the model at prediction at point $x_{i}$ :

[TABLE]

That is, the entropy over all the classes that a specific data point may belong too. The datapoints selected should maximize the sum of informativeness:

[TABLE]

These three methods are reliant on the data, with USWCD being the only method which takes some model input. Data-centric methods should allow for the data to be better transferred between models; this is illustrated in the experiments.

IV Experimental Design

IV-A Data

All experiments are conducted on data sets from the UCI Machine Learning repository [32], and the benchmark methods used are the ones previously defined. Table 1 displays general information about each data set. Batch size is the number of datapoints queried at each active learning iteration, for larger data sets a batch size of 100 is used while 25 points per sample is used for smaller data sets. All data sets, other than the Monk data set, are randomly split so that there are 10% to test on. Of the remaining 90% of data, 2.5% is used as initial training data and 97.5% is used as the query set. The Monk data set is pre-partitioned into training and testing sets, so the training set (which is the size listed in Table 1) is split into 97.5% query points and 2.5% initial training set.

IV-B Performance Measures

F1 is used as the measure of performance for each of the classifiers, as F1 will take into account class imbalance as well as model performance unlike model accuracy which only looks at model performance. Also, F1 balances precision and recall as opposed to other F-measures. F1 is calculated as

[TABLE]

Precision and Recall are defined as the following, where tp is true positive and fn is false negative:

[TABLE]

Experiments on each data set are conducted three times, each time using the same random partition of data. The average F1 of the three runs is used to determine performance.

To quantify the performance over all iterations of sampling we take the area under the learning curve (AUC) for which the x-axis is number of datapoints queried and the y-axis is F1. The area is determined using the Trapezoidal rule as implemented in Numpy [33].

To determine the effect of the proposed methods on sampling bias, a sampling bias comparison method proposed by Krishan et al. [13] is used, this is presented in equation 2. $H_{D_{L}}$ is the entropy of the sampled set and $H_{Balanced}$ is the entropy of a set with an equal number of datapoints from each class.

[TABLE]

$H_{D_{L}}$ is defined as $-\sum_{k=1}^{K}\frac{M_{k}}{M}log(\frac{M_{k}}{M})$ where $M_{k}$ is the number of datapoints belonging to class k and $M$ is the total number of datapoints in our sample. The average of sampling bias of the three runs is used for comparison.

IV-C Experiment 1

The initial experiment involves sampling with respect to a certain model and using the same model to compare methods. This is the active learning scenario most frequently discussed and presented in the ‘initial labeling effort’ section of Figure 1. All data sets are sampled with respect to, trained, and tested utilizing a Random Forest Classifier with max depth constrained to five.

IV-D Experiment 2

To determine the effectiveness of the methods in sampling points that are beneficial to a model outside the learning loop as depicted in the ‘Reuse of Labeled Data’ portion of Figure 1, another experiment is employed. The points sampled with respect to a Random Forest Classifier with max depth constrained to 5 are used to train a Decision Tree classifier and Support Vector Machine (SVM); for the SVM a Support Vector Classifier (SVC) implementation is used. Both the Decision Tree and SVC do not have any hyperparameter tuning and are utilized as is from Scikit-learn [34]. That is, the Decision Tree uses Gini impurity to determine quality of split and is not contrained to a max depth, the SVC uses a radial basis function kernel and has the squared L2 norm as a regularization term. After each iteration of sampling, all models are tested and the average of the three runs is used for the results. These experiments should provide an understanding as to how methods compare when sampling and training with respect to a particular model, and further will depict the effectiveness of these sampled points to transfer to different models. Both scenarios are crucial to machine learning and active learning deployment.

V Results

The learning curves for the chess and nursery datasets using Random Forest and SVM classifiers are in figures 2 and 3; these plots are F1 score versus number of query points added. Dashed lines signify the benchmark methods while solid lines signify the methods incorporating coverage. AUC of F1 vs query points added to training set for each of the sampling methods are presented in Table 2. Query by Committee has the best performance a majority of the trials, seven, but USWCD is a near second with six best performances. To break down instances in which each method outperforms the others we look at the performance on the original model as well as on the models that the data is transferred to.

We first focus on ‘Experiment 1’ as described in Section IV. The proposed methods should be competitive with the benchmark methods when sampling and testing with respect to a specific model, in our case the Random Forest model restricted to a depth of five. To examine this performance we look at the percent difference in AUC of each method from the best performing method on each dataset, this is presented in Table 3. USWCD is the best performer once, it has 0.00% difference in AUC from the best method, itself. Uncertainty sampling and information density also perform best only once, but Query by Committee performs best three times. Though USWCD does not always perform the best, it achieves performance nearest the best performer in 60% of instances where it is not the best performer. So, USWCD does achieve a competitive performance for the model in the active learning loop, next we study models outside the learning loop as described in ‘Experiment 2’.

In machine learning deployment, the model in use would likely not change unless the new model provides some benefit such as computational efficiency or performance. For this study we pay special attention to performance, and look at instances in which the use of a Decision Tree or SVM increases final model performance by 5% or more. As Lowell et al. [9] point out, active learning methods often do not outperform random sampling when the sampled data is transferred to a new model. Therefore, we treat random sampling as a baseline for data transfer comparison. Table 4 shows the percent difference in area under F1 versus queried points curve between each method and random sampling when model performance increases by 5% or more. In a majority of the presented scenarios USWCD outperforms the other methods. This is also the only method which does not perform worse than random sampling in any instance of model improvement. So, the proposed method is effective in sampling data which is transferable to new and more effective models.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: a survey,” ACM SIGMOD Record , vol. 47, no. 2, pp. 17–28, 2018.
2[2] B. Settles, “Active learning literature survey,” Science , vol. 10, no. 3, pp. 237–304, 1995.
3[3] L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation: A deep active learning framework for biomedical image segmentation,” in International conference on medical image computing and computer-assisted intervention . Springer, 2017, pp. 399–407.
4[4] B. Settles and M. Craven, “An analysis of active learning strategies for sequence labeling tasks,” in proceedings of the 2008 conference on empirical methods in natural language processing , 2008, pp. 1070–1079.
5[5] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to medical image classification,” in Proceedings of the 23rd international conference on Machine learning , 2006, pp. 417–424.
6[6] P. Zhao and S. C. Hoi, “Cost-sensitive online active learning with application to malicious url detection,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining , 2013, pp. 919–927.
7[7] N. Nissim, A. Cohen, and Y. Elovici, “Boosting the detection of malicious documents using designated active learning methods,” in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) . IEEE, 2015, pp. 760–765.
8[8] S. K. Dasari, A. Cheddad, L. Lundberg, and J. Palmquist, “Active learning to support in-situ process monitoring in additive manufacturing,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) . IEEE, 2021, pp. 1168–1173.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Active Learning with Combinatorial Coverage

Abstract

Index Terms:

I Introduction

II Background

II-A Active Learning

II-B Combinatorial Interaction Testing

Definition 1** (ttt-way Set Difference Combinatorial Coverage).**

III Methods

Definition 2** (Coverage Density).**

Definition 3** (Coverage Density Sampling).**

Definition 4** (Informative Coverage Density Sampling).**

Definition 5** (Uncertainty Sampling Weighted by Coverage Density(USWCD)).**

IV Experimental Design

IV-A Data

IV-B Performance Measures

IV-C Experiment 1

IV-D Experiment 2

V Results

Definition 1 ( $t$ -way Set Difference Combinatorial Coverage).

Definition 2 (Coverage Density).

Definition 3 (Coverage Density Sampling).

Definition 4 (Informative Coverage Density Sampling).

Definition 5 (Uncertainty Sampling Weighted by Coverage Density(USWCD)).