MOBA: A multi-objective bounded-abstention model for two-class   cost-sensitive problems

Hongjiao Guan

arXiv:1905.07297·cs.LG·May 20, 2019

MOBA: A multi-objective bounded-abstention model for two-class cost-sensitive problems

Hongjiao Guan

PDF

Open Access

TL;DR

The paper introduces MOBA, a multi-objective abstention model for two-class cost-sensitive problems that optimizes multiple metrics without relying on explicit cost information, using evolutionary algorithms to generate Pareto-optimal solutions.

Contribution

It proposes a novel multi-objective bounded-abstention model that does not require cost estimation and balances multiple performance metrics using evolutionary optimization.

Findings

01

MOBA achieves lower expected costs compared to state-of-the-art models.

02

It provides better trade-offs between performance and abstention.

03

The model is robust to variations in cost information and performance demands.

Abstract

Abstaining classifiers have been widely used in cost-sensitive applications to avoid ambiguous classification and reduce the cost of misclassification. Previous abstaining classification models rely on cost information, such as a cost matrix or cost ratio. However, it is difficult to obtain or estimate costs in practical applications. Furthermore, these abstention models are typically restricted to a single optimization metric, which may not be the expected indicator when evaluating classification performance. To overcome such problems, a multi-objective bounded-abstention (MOBA) model is proposed to optimize essential metrics. Specifically, the MOBA model minimizes the error rate of each class under class-dependent abstention constraints. The MOBA model is then solved using the non-dominated sorting genetic algorithm II, which is a popular evolutionary multi-objective optimization…

Tables4

Table 1. Table 1: Confusion matrix with rejection in binary classification problems

Confusion matrix		Predicted label
Confusion matrix		+	-	R
Real label	+	TP	FN	RP
Real label	-	FP	TN	RN

Table 2. Table 2: Characteristics of datasets used in this study

Dataset	Instances	Positive	Negative	Attrsibutes
ionosphere	351	126	225	34
pima	768	268	500	8
credit-g	1000	300	700	20
ecoli3	336	35	301	7
hepatitis	155	32	123	19
haberman	306	81	225	3
cmc	1473	333	1140	9
transfusion	748	178	570	4
Australian	690	307	383	14

Table 3. Table 3: Cost Models

	CTP/N	CFP	CFN	CRP	CRN
CM1	U[-10,0]	U[0,50]	U[0,50]	1	1
CM2	U[-10,0]	U[0,100]	U[0,50]	1	1
CM3	U[-10,0]	U[0,50]	U[0,100]	1	1
CM4	U[-10,0]	U[0,50]	U[0,50]	U[0,30]	U[0,30]

Table 4. Table 4: Results of the Wilcoxon rank sum test for comparing MOBA and Tortorell’s model

	CM1	CM2	CM3	CM4		CM1	CM2	CM3	CM4
ionosphere	744	749	781	471	haberman	466	393	523	484
	158	203	182	56		436	559	440	43
	98	48	37	473		98	48	37	473
pima	459	597	524	412	cmc	93	45	135	338
	443	350	439	115		65	23	67	57
	98	53	37	473		842	932	798	605
credit-g	489	516	480	444	transfusion	621	619	529	484
	413	436	483	83		281	333	434	43
	98	48	37	473		98	48	37	473
ecoli3	738	531	522	438	Australian	659	711	664	473
	164	416	441	89		243	241	299	54
	98	53	37	473		98	48	37	473
hepatitis	510	520	514	435
	392	432	449	92
	98	48	37	473

Equations39

C_{t_{1}, t_{2}} (m) = ⎩ ⎨ ⎧ +, -, R, if s (m) > t_{2}; if s (m) \leq t_{1}; otherwise .

C_{t_{1}, t_{2}} (m) = ⎩ ⎨ ⎧ +, -, R, if s (m) > t_{2}; if s (m) \leq t_{1}; otherwise .

t_{1}, t_{2} min cos t (t_{1}, t_{2}),

t_{1}, t_{2} min cos t (t_{1}, t_{2}),

cos t (t_{1}, t_{2}) =

cos t (t_{1}, t_{2}) =

p (+) \cdot C F N \cdot f n r (t_{1}) + p (-) \cdot C T N \cdot t n r (t_{1}) +

p (+) \cdot C T P \cdot tp r (t_{2}) + p (-) \cdot C F P \cdot f p r (t_{2}) +

p (+) \cdot C R P \cdot r p r (t_{1}, t_{2}) + p (-) \cdot C R N \cdot r n r (t_{1}, t_{2}),

t_{1}, t_{2} min \frac{C F N \cdot F N ( t _{1} ) + C F P \cdot F P ( t _{2} )}{T N ( t _{1} ) + F P ( t _{2} ) + T P ( t _{2} ) + F N ( t _{1} )}, s . t . r e j (t_{1}, t_{2}) \leq k_{ma x},

t_{1}, t_{2} min \frac{C F N \cdot F N ( t _{1} ) + C F P \cdot F P ( t _{2} )}{T N ( t _{1} ) + F P ( t _{2} ) + T P ( t _{2} ) + F N ( t _{1} )}, s . t . r e j (t_{1}, t_{2}) \leq k_{ma x},

t_{1}, t_{2} min F (t) = (F_{1} (t), F_{2} (t)) = (f p r (t_{2}), f n r (t_{1})),

t_{1}, t_{2} min F (t) = (F_{1} (t), F_{2} (t)) = (f p r (t_{2}), f n r (t_{1})),

s . t . ⎩ ⎨ ⎧ r p r (t) \leq p_{ma x}, r n r (t) \leq n_{ma x}, t_{1} < t_{2},

T = {t \in R^{2} ∣ r p r (t) \leq p_{ma x}, r n r (t) \leq n_{ma x}, t_{1} < t_{2}},

T = {t \in R^{2} ∣ r p r (t) \leq p_{ma x}, r n r (t) \leq n_{ma x}, t_{1} < t_{2}},

\forall i \in {1, 2}, F_{i} (a) \leq F_{i} (b) \land \exists j \in {1, 2}, F_{j} (a) < F_{j} (b) .

\forall i \in {1, 2}, F_{i} (a) \leq F_{i} (b) \land \exists j \in {1, 2}, F_{j} (a) < F_{j} (b) .

P O S = {t \in T ∣\neg\exists t^{'} \in T, t^{'} ≺ t} .

P O S = {t \in T ∣\neg\exists t^{'} \in T, t^{'} ≺ t} .

P O F = {F (t) \in [0, 1]^{2} ∣ t \in P O S} .

P O F = {F (t) \in [0, 1]^{2} ∣ t \in P O S} .

d_{1} = \frac{F _{1}^{k + 1} - F _{1}^{k - 1}}{F _{1}^{ma x} - F _{1}^{min}},

d_{1} = \frac{F _{1}^{k + 1} - F _{1}^{k - 1}}{F _{1}^{ma x} - F _{1}^{min}},

y_{1, m} = \frac{1}{2} [(1 - β_{m}) x_{1, m} + (1 + β_{m}) x_{2, m}],

y_{1, m} = \frac{1}{2} [(1 - β_{m}) x_{1, m} + (1 + β_{m}) x_{2, m}],

y_{2, m} = \frac{1}{2} [(1 + β_{m}) x_{1, m} + (1 - β_{m}) x_{2, m}],

p (β) = {\frac{1}{2} (η_{c} + 1) β^{η_{c}}, \frac{1}{2} (η_{c} + 1) \frac{1}{β ^{η_{c} + 2}}, if 0 \leq β \leq 1 if β > 1

p (β) = {\frac{1}{2} (η_{c} + 1) β^{η_{c}}, \frac{1}{2} (η_{c} + 1) \frac{1}{β ^{η_{c} + 2}}, if 0 \leq β \leq 1 if β > 1

β (u) = ⎩ ⎨ ⎧ (2 u)^{\frac{1}{η _{c} + 1}}, \frac{1}{( 2 - 2 u ) ^{\frac{1}{η _{c} + 1}}} . if u \leq 0.5 if u > 0.5

β (u) = ⎩ ⎨ ⎧ (2 u)^{\frac{1}{η _{c} + 1}}, \frac{1}{( 2 - 2 u ) ^{\frac{1}{η _{c} + 1}}} . if u \leq 0.5 if u > 0.5

y_{m} = x_{m} + (x_{m}^{u} - x_{m}^{l}) δ_{m},

y_{m} = x_{m} + (x_{m}^{u} - x_{m}^{l}) δ_{m},

p (δ) = \frac{1}{2} (η_{m} + 1) (1 - ∣ δ ∣^{η_{m}}),

p (δ) = \frac{1}{2} (η_{m} + 1) (1 - ∣ δ ∣^{η_{m}}),

δ (u) = {(2 u)^{\frac{1}{η _{m} + 1}} - 1, 1 - (2 - 2 u)^{\frac{1}{η _{m} + 1}}, if u < 0.5 if u \geq 0.5

δ (u) = {(2 u)^{\frac{1}{η _{m} + 1}} - 1, 1 - (2 - 2 u)^{\frac{1}{η _{m} + 1}}, if u < 0.5 if u \geq 0.5

\frac{C T N - C R N}{C F N - C R P} > \frac{C F P - C R N}{C T P - C R P}

\frac{C T N - C R N}{C F N - C R P} > \frac{C F P - C R N}{C T P - C R P}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Machine Learning and Data Classification · Data Stream Mining Techniques

Full text

MOBA: A multi-objective bounded-abstention model for two-class cost-sensitive problems

Hongjiao Guan

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China

Abstract

Abstaining classifiers have been widely used in cost-sensitive applications to avoid ambiguous classification and reduce the cost of misclassification. Previous abstaining classification models rely on cost information, such as a cost matrix or cost ratio. However, it is difficult to obtain or estimate costs in practical applications. Furthermore, these abstention models are typically restricted to a single optimization metric, which may not be the expected indicator when evaluating classification performance. To overcome such problems, a multi-objective bounded-abstention (MOBA) model is proposed to optimize essential metrics. Specifically, the MOBA model minimizes the error rate of each class under class-dependent abstention constraints. The MOBA model is then solved using the non-dominated sorting genetic algorithm II, which is a popular evolutionary multi-objective optimization algorithm. A set of Pareto-optimal solutions will be generated and the best one can be selected according to provided conditions (whether costs are known) or performance demands (e.g., obtaining a high accuracy, F-measure, and etc). Hence, the MOBA model is robust towards variations in the conditions and requirements. Compared to state-of-the-art abstention models, MOBA achieves lower expected costs when cost information is considered, and better performance-abstention trade-offs when it is not.

keywords:

Abstaining classification , Cost-sensitive problems , Multi-objective optimization (MOO) , Evolutionary algorithm (EA)

1 Introduction

Forcing the classification of uncertain instances in safety-critical applications can lead to misclassification, which can result in economic losses or increased costs. In contrast, the cost of possible errors can be reduced by abstaining from ambiguous classification, which has been used in cost-sensitive fields [1, 2, 3, 4].

The general classification rule with reject option in binary classification is shown in Eq. (1) and can be explained as follows. If the confidence score $s$ belonging to the positive class is larger than $t_{2}$ , then example $m$ is classified as positive (+); if $s$ is not larger than $t_{1}$ , then $m$ is classified as negative (-); otherwise, the example is not labeled (rejected, R).

[TABLE]

When $t_{1}=t_{2}$ , Eq. (1) reverts to the traditional binary classification rule. The rejection thresholds $t_{1}$ and $t_{2}$ define the decision boundaries and the task of training the abstention classifier lies in determining the two rejection thresholds, which can be enforced by establishing different abstention models. Note that only binary classification is discussed in this paper.

In the context of abstaining classification, the ideal situation is to establish a cost minimization model, which requires the costs of correct classification, misclassification, and rejection to be known. However, it is often difficult to obtain or estimate cost information in many real-world classification problems. For instance, in the diagnosis of normal and cancerous tissues, the imaging characteristics of some tissues are ambiguous, in which case it is hard to make a definitive diagnosis. Both misdiagnosis and a missed diagnosis can lead to physical and mental pain, and their costs are usually unknown and unequal. There are mainly two abstention models as follows: unconditional optimization model of the expected cost and conditional optimization model with rejection or performance constraints.

Chow [5, 6] expanded classical Bayesian decision theory by proposing a generalized decision theory with reject option. The well-known Bayesian decision rules are based on a minimum error rate or a minimum risk function. For example, minimum risk Bayesian classification only considers class-dependent misclassification costs. In contrast, once the reject option is added, the generalized Bayesian decision theory minimizes the expected risk based on the rejection costs as well as the costs of correct and incorrect classification. Tortorell [7, 8] proposed the following implementable abstention model based on the generalized theory:

[TABLE]

where

[TABLE]

where $p(+)$ and $p(-)$ are the prior probabilities of the positive and negative classes, respectively, CFN represents the cost of a false negative, and $fnr$ denotes the ratio of false negative examples among all positive examples. Likewise for CTN, CTP, CFP, $tnr$ , $tpr$ , and $fpr$ . CRP and CRN indicate the rejection costs for positive and negative classes, respectively, and $rpr$ and $rnr$ are the ratios of rejected examples with respect to the positive and negative classes, respectively. This abstention model requires the complete cost information including the costs of correct classification, misclassification, and rejection to be known.

Pietraszek [9] proposed a bounded-abstention (BA) model that adds an abstention constraint and only requires knowledge of the misclassification costs. The BA model can be represented as:

[TABLE]

where $FN$ ( $FP$ ) refers to the number of false negatives (positives); $TN$ ( $TP$ ) represents the number of true negatives (positives); and $rej$ denotes the overall reject rate, i.e., the number of rejected examples divided by the sample size. It is useful to define a cost ratio of CFN to CFP. When the values of CFN and CFP are the same, i.e., the cost ratio is 1, model (4) is to minimize the error rate under an abstention constraint. In [10], the receiver operating characteristic (ROC) isometric model is employed to minimize the reject rate under class-dependent performance constraints. In the performance constraints, the class distribution and misclassification cost ratios are considered. Although the BA and ROC isometric models avoid setting complete costs (CTP/N, CFP/N, and CRP/N), misclassification costs or their cost ratio are still required.

The abstention models mentioned above have the following shortcomings:

•

they require cost information to be known. However, in practical problems: (a) such costs are hard to obtain or estimate; (b) they are usually dependent on classes and are asymmetric; and (c) in some cases, the cost information evolves over time, which prevents use of the trained model during the test phase.

•

they only optimize a single performance metric, such as the expected cost, error rate, precision, or F-measure. The optimized metric may not be useful when practitioners evaluate the classification performance.

To overcome these drawbacks, a multi-objective bounded-abstention (MOBA) model is proposed that minimizes two essential metrics, namely, the false positive rate and false negative rate, under class-dependent abstention constraints. The MOBA model optimizes essential metrics, via which any simple or complicated metric can be calculated as long as the sample sizes of two classes are fixed. In addition, the MOBA model is applicable regardless of whether the costs are known or unknown. A popular evolutionary algorithm called the non-dominated sorting genetic algorithm II (NSGA-II) [11] is used to solve the multi-objective optimization (MOO) problem with constraints.

The remainder of the paper is organized as follows. Section 2 introduces the motivation and optimization method of the proposed MOBA model, discusses the methods of selecting the best abstaining classifier under different conditions, and summarizes the advantages of the MOBA model. The results of two experiments, one with costs and one without, are presented in Section 3, and the conclusions drawn from this research are summarized in Section 4.

2 Proposed MOBA model

In this section, the motivation that inspires to propose the MOBA model is first presented in Section 2.1 and then several basic concepts related to MOO problems are provided in Section 2.2. Next, the NSGA-II algorithm is introduced along with the implementation details of solving the MOBA model (Section 2.3). Selection of the best abstaining classifier and the advantages of the MOBA model are explained in Sections 2.4 and 2.5.

2.1 Motivation

Regardless of whether the expected cost, error rate, or F-measure are to be optimized, previous abstention models have always required certain essential metrics to be provided, such as the true positive/negative rate ( $tpr/tnr$ ), false positive/negative rate ( $fpr/fnr$ ), positive/negative predictive value ( $ppv/npv$ ), and rejected positive/negative rate ( $rpr/rnr$ ). For example, when minimizing the average cost in (1), six essential metrics ( $tpr/tnr$ , $fpr/fnr$ , and $rpr/rnr$ ) are required while when maximizing the F-measure, $ppv$ and $tpr$ are needed to be known. All the essential metrics can be calculated from the confusion matrix with rejection (Table 1). For example, $rpr=RP\,/\,(TP+FN+RP)$ , i.e., the number of rejected positive examples (RP) divided by the number of all positive examples (TP+FN+RP).

In view of this, optimizing the essential metrics is a natural and reasonable idea. When the sample sizes of two classes are fixed, the confusion matrix with rejection has four degrees of freedom. Therefore, to obtain a definite rejection classifier, four essential metrics should be determined. The MOBA model optimizes the false positive and negative rates under the constraints of the rejected positive and negative rates. A formal description of the MOBA model is as follows:

[TABLE]

where $\bm{t}=(t_{1},t_{2})$ is the decision vector, $\bm{F(t)}$ represents the objective function vector, which contains two objective functions $fpr$ (related to $t_{2}$ ) and $fnr$ (related to $t_{1}$ ), and $t_{1}$ and $t_{2}$ ( $t_{1}<t_{2}$ ) denote the rejection thresholds. Note that the definitions of $fpr$ and $fnr$ here are different from those in Eq. (1). In this case, $fpr$ ( $fnr$ ) denotes the ratios of false positive (negative) examples among the classified negative (positive) examples. After the two rejection thresholds have been determined, $fpr=FP\,/\,(TN+FP)$ and $fnr=FN\,/\,(TP+FN)$ can be calculated. The feasible solution set $T\subseteq\mathbb{R}^{2}$ is the set of decision vectors that satisfy the constraints:

[TABLE]

where $p_{max}$ and $n_{max}$ are the hyperparameters. Note that maximizing $tpr$ and $tnr$ is equivalent to minimizing $fpr$ and $fnr$ in Eq. (5).

2.2 Concepts associated with MOO problems

Since MOO problems involve multiple conflicting objectives, the comparison relation in single-objective optimization problems is not applicable. For a given decision vector, some objectives are optimal whereas others are not, and optimizing the suboptimal objectives may degrade the optimal objectives. The partially ordering relation, i.e., the Pareto dominance, is used to compare decision vectors in MOO problems. A decision vector $\bm{a}\in T$ is said to Pareto dominate another decision vector $\bm{b}\in T$ , denoted as $\bm{a}\prec\bm{b}$ , if and only if (iff):

[TABLE]

A decision vector $\bm{t}$ is non-dominated with regard to $T$ iff there is no decision vector in $T$ that dominates $\bm{t}$ . Such non-dominated solutions are referred to as Pareto-optimal. The set of Pareto-optimal solutions related to $T$ is referred to as the Pareto-optimal set (POS):

[TABLE]

The set of objective vectors corresponding to the POS is referred to as the Pareto-optimal front (POF):

[TABLE]

Evolutionary algorithms (EAs) based on Pareto dominance exhibit excellent performance when solving MOO problems with few (two or three) objectives [12]. EAs search the set of Pareto-optimal solutions in parallel in a single run. Popular EAs include the NSGA-II [11], strength Pareto evolutionary algorithm 2 (SPEA2) [13], and Pareto envelope based selection algorithm II (PESA-II) [14]. The popular NSGA-II algorithm was adopted in this study to solve the proposed MOBA model.

2.3 Evolutionary MOO of MOBA

In this section, the NSGA-II algorithm is introduced along with the details required to optimize the proposed MOBA model. NSGA-II improves on the previous NSGA [15] by developing a fast non-dominated sorting and elitism strategy. The basic framework of NSGA-II is presented in Algorithm 1.

Pop-initialization generates the initial population that includes $popsize$ individuals (chromosomes). Each decision vector $\bm{t}$ denotes an individual or a chromosome. In the MOBA problem, the two variables in each decision vector, i.e., the two rejection thresholds $t_{1}$ and $t_{2}$ , are encoded with real values. Specifically, the scores $s$ of the training examples are first determined via a scoring classifier. Traditional classification methods, such as support vector machine and k-nearest neighbor, can be used as the scoring classifier [16]. Let $s_{min}$ and $s_{max}$ denote the minimal and maximal ones among the scores of all training examples, respectively. Then, $t_{1}$ and $t_{2}$ are randomly generated in the range of [ $s_{min},s_{max}$ ] only if $t_{1}<t_{2}$ is satisfied. In this study, $popsize$ and $gensize$ are set to 20 and 100, respectively.

Non-dominated-sort sorts the individuals in population $P^{t}$ based on non-domination and outputs the front set ${\mathcal{F}}^{t}=\{{\mathcal{F}}_{1}^{t},{\mathcal{F}}_{2}^{t},\cdots,{\mathcal{F}}_{n}^{t}\}$ . The individuals in each front ${\mathcal{F}}_{i}^{t}$ ( $i\in\{1,2,\cdots,n\}$ ) are non-dominated while the individuals belonging to ${\mathcal{F}}_{i}^{t}$ are dominated by the individuals in front ${\mathcal{F}}_{j}^{t}$ ( $j<i,j\in\{1,2,\cdots,n\}$ ) in the $t^{th}$ generation. The front sets are depicted in Figure 1, where $F_{1}$ and $F_{2}$ represent the values of two objective functions. For the detailed sort algorithm, please refer to [11]. The objective values $fpr$ and $fnr$ are computed using the validation set. Specifically, all examples are divided into three parts consisting of the training, validation, and test sets. The training set is employed to construct a scoring classifier, and the scores of the examples in the validation set are computed using the scoring classifier. Given $t_{1}$ and $t_{2}$ (corresponding to the variables in each individual), Eq. (1) can be applied to the validation examples, thereby allowing the basic metrics in Table 1 to be calculated. If the constraints (6) are not satisfied, $fpr$ and $fnr$ are assigned the maximum value of one, in which case the corresponding individual is not considered.

Crowding-distance-assignment computes the crowding distances of individuals in each front and outputs the spread set ${\mathcal{S}}^{t}=\{{\mathcal{S}}_{1}^{t},{\mathcal{S}}_{2}^{t},\cdots,{\mathcal{S}}_{n}^{t}\}$ , where ${\mathcal{S}}_{i}^{t}$ ( $i\in\{1,2,\cdots,n\}$ ) stores the crowding distances of the individuals in front ${\mathcal{F}}_{i}^{t}$ in the $t^{th}$ generation. Note that since the crowding distance is assigned within each front, it is meaningless to compare the crowding distances of two individuals from different fronts. The essential idea of calculating the crowding distance is to sort all individuals in the same front based on each objective function and to average the Euclidean distances between the nearest neighbors in each dimension of the objective function space. An illustration of the crowding distance is shown in Figure 2. For objective function $F_{1}$ , the distance $d_{1}$ of the $k^{th}$ individual is:

[TABLE]

where $F_{1}^{k+1}$ and $F_{1}^{k-1}$ are the objective values of the $(k+1)^{th}$ and $(k-1)^{th}$ individuals in dimension $F_{1}$ , respectively, and $F_{1}^{max}$ and $F_{1}^{min}$ are the maximum and minimum values in the front, respectively. A similar approach can be used to obtain the distance of the $k^{th}$ individual in other dimensions of the objective function space. Finally, the crowding distance of the $k^{th}$ individual is the sum of the distances in all dimensions. Note that the crowding distance of the boundary individual is assigned as infinite. In each ${\mathcal{S}}_{i}^{t}$ , the crowding distances are sorted in descending order.

Tournament-selection uses a binary tournament strategy to select $popsize$ individuals from the population $P^{t}$ as follows. First, two individuals are randomly selected from $P^{t}$ . Then, one of them is chosen based on two criteria, namely, the front rank and crowding distance. If the two individuals are in different fronts, the individual with the lower front rank is selected. However, if the front ranks of the two individuals are the same, then the individual with the larger crowding distance is selected to maintain solution diversity. The binary tournament selection process is performed $popsize$ times to obtain $popsize$ individuals, which then constitute the $parent$ population.

Genetic-operation performs the simulated binary crossover (SBX) [17] and polynomial mutation [18] operations in the real value coded evolutionary MOO algorithm. Specifically, in SBX, two children chromosomes are generated as follows [19]:

[TABLE]

where $y_{1,m}$ and $y_{2,m}$ ( $m\in\{1,2\}$ ) are the $m^{th}$ variables in the two children $y_{1}$ and $y_{2}$ , respectively; $x_{1,m}$ and $x_{2,m}$ are the $m^{th}$ variables of the randomly selected parents, respectively, and $\beta_{m}(\geq 0)$ is a sample from a random number generator having the density:

[TABLE]

where $\eta_{c}$ is the distribution index of the crossover. The distribution can be obtained from a uniformly sampled random number $u$ between (0,1):

[TABLE]

When generating children individuals using the SBX operation, a random number $u$ between (0,1) is obtained, $\beta$ is computed as per Eq. (15), and the child variables are obtained via Eqs. (12) and (13). After two child chromosomes are obtained, the constraint $t_{1}<t_{2}$ is checked. If the constraint is not satisfied, a new random number between (0,1) is generated and the computation process is repeated.

In polynomial mutation, a child chromosome is generated as follows:

[TABLE]

where $y_{m}$ ( $m\in\{1,2\}$ ) is the $m^{th}$ variable in the child chromosome $y$ , $x_{m}$ is the $m^{th}$ variable of the parent chromosome $x$ , and $x_{m}^{u}$ and $x_{m}^{l}$ are the upper and lower bounds, respectively. In the MOBA model, $x_{m}^{u}=1$ and $x_{m}^{l}=0$ , and $\delta_{m}$ follows the polynomial probability distribution [18]:

[TABLE]

where $\eta_{m}$ is the distribution index for mutation and $\delta$ can be calculated as:

[TABLE]

where $u$ is a random number between (0,1). Similarly, to obtain a child chromosome, $u$ is generated randomly between (0,1) and $\delta$ is computed using Eq. (18). Finally, a child is generated as per Eq. (16) and the constraint $t_{1}<t_{2}$ is checked. New values of $u$ are generated until the constraint is satisfied. Finally, the offspring population $Q^{t}$ is produced from the chromosomes in the $parent$ population.

In the experiments, the commonly used hyperparameters in the NSGA-II algorithm are set as follows [11, 17]. The crossover probability is 0.9; the mutation probability is $1/v$ , where $v$ is the number of decision variables, and here, $v=2$ ; and the distribution indexes of both the crossover and mutation operators are set to 20.

Elite-preservation selects the first $popsize$ best individuals from the combined population $R^{t}$ . All individuals in $R^{t}$ are sorted based on non-domination. The individuals in the low fronts are better solutions than those in the high fronts. If the number of chromosomes in ${\mathcal{F}_{1}}$ is smaller than $popsize$ , the next non-dominated set ${\mathcal{F}_{2}}$ is considered until the size of $P^{t+1}$ is larger than $popsize$ . It is assumed that when the $i^{th}$ front set is added into $P^{t+1}$ , the number of individuals exceeds $popsize$ . The crowding distances of the solutions in ${\mathcal{F}_{i}}$ are assigned and sorted, and the individuals that have large crowding distances are added into $P^{t+1}$ until the size of $P^{t+1}$ is exactly equal to $popsize$ .

The MOBA model requires two hyperparameters to be set, namely, the maximum abstention rates $p_{max}$ and $n_{max}$ with respect to the two classes. A larger abstention rate results in better performance, but increases the cost of dealing with rejected instances. Note that this trade-off always arises in abstaining classification applications. The hyper-parameters can be set depending on the requirements of the application (associated with the performance) and the resource limitations (associated with the reject rate). The hyper-parameter setting is presented in Sections 3.1 and 3.2. It is important to note that the abstention rates of the two classes can be set using different values. This ensures the MOBA model can be applied in situations where the class distribution is imbalanced and the error costs are asymmetric. Furthermore, another advantage of the MOBA model is that a set of classifiers is generated, instead of just one classifier. This provides the ability to select and change classifiers without incurring the expense of retraining.

2.4 Methods of selecting the best abstaining classifier

In the MOBA model, the NSGA-II algorithm outputs multiple Pareto-optimal vectors $(t_{1},t_{2})$ , each of which corresponds to an abstaining classifier. For a fixed abstaining classifier, test examples can be classified or rejected according to Eq. (1), and essential metrics can be calculated. With this in mind, the methods of selecting the best classifier depend on the cost information and the metrics used to evaluate performance.

•

If the complete costs (CTP/N, CFP/N, and CRP/N) are known, then the solution having the minimum expected cost (Eq. (1)) is theoretically optimal. Specifically, at the end of the 100th generation, 20 Pareto-optimal vectors are obtained, each of which corresponds to a confusion matrix with rejection. Hence, $tpr/tnr$ , $fpr/fnr$ , and $rpr/rnr$ can be computed, and correspondingly, a set of 20 expected costs can be obtained as per Eq. (1), from which the smallest one is selected.

•

If the cost information is unknown, practitioners can compare the performance-abstention trade-offs of the Pareto-optimal solutions and select the best solution for a particular circumstance. For example, it is possible to preset tolerable maximum reject rates and then choose the classifier that has the best performance in terms of the accuracy, area under the ROC curve (AUC), etc. Once 20 confusion matrices with rejection are obtained, the corresponding rejected rates can be computed, and eligible Pareto-optimal vectors are selected according to the preset maximum value. Among the selected vectors, the best solution is the one that obtains a highest AUC, for example. Note that AUC is the average of $tpr$ and $tnr$ [20].

2.5 Superiority of MOBA over previous abstaining models

The advantages of the MOBA model are summarized as follows:

(a)

the MOBA model does not require set costs, and although the abstention constraints involve two hyperparameters, they are in the range of (0,1). This is distinctly different than the costs in previous models that take on unbounded real values.

In the models of unconditional optimization of the expected cost Eq. (2) and conditional optimization with constraints such as Eq. (4), cost information is required to be provided, and unfortunately, it is usually unknown. Hence, empirical costs or cost ratios are commonly set in particular applications such as intrusion detection [21] or cost models are used to evaluate statistical results (Section 3.1). In such cases, costs can be set using any real value, which leads to constructing cost-dependent models. When the costs change, retraining models will take additional computation. In contrast, the MOBA model, which does not rely on cost information, only sets two abstention parameters in the range of (0,1). 2. (b)

the MOBA model is robust to varying conditions and demands since a set of Pareto-optimal vectors (corresponding to a set of abstaining classifiers) are generated.

As explained in Section 2.4, the optimal abstaining classifier can be selected according to the demands (e.g., obtaining a maximum F-measure or a minimum error rate under fixed abstentions) or conditions (whether the costs are known). Furthermore, if the costs change over time, no retraining of the MOBA model is required as a new optimal abstaining classifier can be determined simply by recalculating the expected costs.

(c)

compared to the BA model, the MOBA model can control the respective performance of two classes via the class-dependent abstention constraints, so the MOBA model is more applicable to imbalanced datasets or cost-sensitive problems.

The BA model has an overall reject rate constraint. When datasets are imbalanced, the reject rates of two classes may be imbalanced even though the overall reject rate constraint is satisfied. When the two abstention parameters are set using the same value, the MOBA model can avoid imbalanced reject rates because of its class-dependent abstention constraints. In addition, the two abstention parameters in the MOBA model can have different values when dealing with imbalanced datasets.

3 Experimental results

In the experiments, the proposed MOBA model was compared with two abstaining classification models: one considering costs (Section 3.1) and the other one not (Section 3.2). Table 2 lists the datasets used in the experiments, which are available in the KEEL-dataset repository [22]. Among these datasets, there are several ones associated with cost-sensitive classification tasks, such as pima and credit-g. For pima, the task is to predict whether the patient is diabetic. The cost of missing a diabetic patient is higher than that of misdiagnosing a nondiabetic patient. Credit-g is a dataset in the financial area, which classifies people as good or bad credit customers. It is worser to classify bad credit customers as good than the opposite case. Each dataset was divided into three distinct subsets: the training set (containing 60% of the examples) used to generate confidence scores by training a scoring classifier, and the validation and test sets, each of which contained 20% of the examples. The validation set was used to determine the rejection thresholds while the classification performance was evaluated via the test set. All the experimental results were obtained using MATLAB R2017a.

3.1 Comparison of the results when the costs are considered

In this section, the results of the MOBA model that accounted for the costs are evaluated in comparison to those from the model by Tortorell in which an ROC convex hull (ROCCH) curve was constructed from the confidence scores and the rejection thresholds were determined based on the tangents of the ROCCH curve [7, 8, 23]. In Tortorell’s model, when the condition

[TABLE]

was not satisfied, the reject option could not be activated. That is to say, the traditional classifier without rejection could provide the minimal cost. In this experiment, the twin support vector machine (TWSVM) [24] was used as the scoring classifier to generate confidence scores [25]. Four cost models [25] shown in Table 3 were used, where U[a,b] denotes a uniform distribution over the interval [a,b]. Note that while CRP and CRN had equal values in CM4 in [25], in this experiment, CRP and CRN had different values in CM4 as class-dependent reject costs were considered.

A Wilcoxon rank sum test [25, 7] was performed to compare the two cost-related abstaining models. The details of the Wilcoxon rank sum test can be found in [25]. In this test, 1000 cost matrices (CTP/N, CFP/N, and CRP/N) were generated for each cost model in Table 3. Then, for each cost matrix, the expected cost was computed as per Eq. (1). Finally, the numbers of cases where the cost of the MOBA model was lower, higher, or identical compared to the cost of Tortorell’s model were counted. There were two scenarios that resulted in identical costs: 1) the costs of the compared methods were equal; or 2) for a certain cost matrix, the reject option in Tortorell’s model was not activated, in which situation, no MOBA model was constructed. The hyperparameters in the MOBA model were set as follows. First, Tortorell’s model was enforced and the reject rates with respect to the two classes were obtained. Then, the values corresponding to the reject rates were assigned to $p_{max}$ and $n_{max}$ . The two reject rates obtained via Tortorell’s model can be regarded as good candidates to avoid blindly setting the hyperparameters in the MOBA model.

The results of the Wilcoxon rank sum test are shown in Table 4. Note that in each scenario (each dataset in each cost model), the three figures from top to bottom represent the numbers of lower, higher, and identical costs in the MOBA model compared to the costs obtained via Tortorell’s model, respectively. In the table, it can be seen that there were a greater number of lower-cost cases for the MOBA model in almost all of the scenarios in the CM1, CM2, and CM3 cost models, while for most scenarios in the CM4 cost model, the identical-cost case was the most frequent. This is because the number of inactivated reject options increased. In the remaining two cases (i.e., in the lower- and higher-cost cases), the MOBA model provided a lower cost than Tortorell’s model in the vast majority of cases. In addition, for a certain cost matrix, the reject option may not be activated in Tortorell’s model, whereas the rejection could still be enforced in the MOBA model. This is an advantage of the MOBA model over Tortorell’s model; that is, the MOBA model is not subject to a certain cost matrix.

3.2 Comparison of the results when the costs are not considered

In this section, the results of the MOBA model that did not consider cost information are presented in comparison to those of the BA model [9]. In the BA model, the cost ratio between CFN to CFP was set to one, which means that the CFN and CFP were the same. For this reason, the values of $p_{max}$ and $n_{max}$ in the MOBA model were set to the same values. When $p_{max}=n_{max}$ , the overall reject rate $k_{max}$ in (4) was equal to the reject rate of each class ( $p_{max}$ or $n_{max}$ ), which ensures the comparability of the BA and MOBA models. The abstention parameters were set from 0.01 to 0.3 in steps of 0.02 since larger rejection rates are usually of no significance in practical applications [26]. For models that do not consider the costs, the classification results are typically evaluated based on the corresponding performance-rejection curves. Here, the accuracy (ACC), AUC, and G-mean (G) were used as the evaluation metrics. Note that the G-mean is the geometric mean of the sensitivity and specificity. That is, ACC-Rej, AUC-Rej, and G-Rej curves were obtained using each dataset. In this paper, only the trade-off curves of datasets pima, haberman, cmc, and transfusion are shown (Figures 3-6) to discuss the performance of the two compared models. Similar trade-off curves were obtained for the other datasets.

Overall, the MOBA model provided a better trade-off between the performance (accuracy, AUC, and G-mean) and rejection than the BA model. In other words, for a fixed reject rate, the MOBA model provided higher values for the accuracy, AUC, and G-mean than those of the BA model. For the MOBA model, the values of the accuracy, AUC, and G-mean grew with increasing reject rates, while for the BA model, only the ACC-Rej curves exhibited an increasing trend and the AUC and G-mean values decreased in Figures 4-6. This is because the BA model minimized the error rate under an overall reject rate and did not consider the class-dependent performance. According to the definitions, the values of the AUC and G-mean were large only when both sensitivity and specificity were large. Thus, the MOBA model was shown to provide more balanced sensitivity and specificity values. The fluctuations observed in the trade-off curves may be attributed to the small sample size and randomness when constructing the models.

4 Conclusions

In this paper, the MOBA model was proposed for cost-sensitive problems where the error costs are usually unknown and asymmetric. The MOBA model avoids setting cost information by optimizing essential metrics. A significant advantage of the MOBA model is its robustness towards different conditions, such as known or unknown cost information, evolving costs over time, and using different metrics to evaluate performance. Also, the MOBA model can accommodate unbalanced classification problems due to its ability to optimize the respective performance of two classes under class-dependent abstention constraints [27, 28]. Experimental results in this study have shown that the MOBA model performed better than previous models. When the costs were known, the MOBA model obtained a greater number of lower-cost cases in the Wilcoxon rank sum test than did Tortorell’s model. When costs were not considered, the MOBA model achieved better trade-offs between performance (accuracy, AUC, and G-mean) and abstention than the BA model.

Abstaining classification has wide applications in safety-critical fields, such as medical diagnosis, fault detection, credit assessment, and so on. Rejection means it is difficult to make a definite decision given the current knowledge. More knowledge should be provided to reduce misclassification costs. The proposed MOBA model will show a promising utility owing to its advantages and better performance. In the future, we intend to extend the MOBA model to multi-class problems and apply the MOBA model to specific applications.

Declarations of interest

None.

References

Chen et al. [2018]

X. Chen, P. Wang, Y. Hao, M. Zhao,

Evidential KNN-based condition monitoring and early warning method with applications in power plant,

Neurocomputing (2018).

Kang et al. [2017]

S. Kang, S. Cho, S.-j. Rhee, K.-S. Yu,

Reliable prediction of anti-diabetic drug failure using a reject option,

Pattern Analysis and Applications 20 (2017) 883–891.

Lin et al. [2018]

D. Lin, L. Sun, K.-A. Toh, J. B. Zhang, Z. Lin,

Biomedical image classification based on a cascade of an svm with a reject option and subspace analysis,

Computers in biology and medicine 96 (2018) 128–140.

Wang et al. [2017]

Z. Wang, Z. Wang, S. He, X. Gu, Z. F. Yan,

Fault detection and diagnosis of chillers using bayesian network merged distance rejection and multi-source non-sensor information,

Applied energy 188 (2017) 200–214.

Chow [1970]

C. Chow,

On optimum recognition error and reject tradeoff,

IEEE Transactions on information theory 16 (1970) 41–46.

Chow [1957]

C.-K. Chow,

An optimum character recognition system using decision functions,

IRE Transactions on Electronic Computers (1957) 247–254.

Tortorella [2004]

Tortorella,

Reducing the classification cost of support vector classifiers through an ROC-based reject rule,

Pattern Analysis and Applications 7 (2004) 128–143.

Tortorella [2000]

F. Tortorella,

An optimal reject rule for binary classifiers,

in: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2000, pp. 611–620.

Pietraszek [2007]

T. Pietraszek,

On the use of ROC analysis for the optimization of abstaining classifiers,

Machine Learning 68 (2007) 137–169.

Vanderlooy et al. [2009]

S. Vanderlooy, I. G. Sprinkhuizen-Kuyper, E. N. Smirnov, H. J. van den Herik,

The ROC isometrics approach to construct reliable classifiers,

Intelligent Data Analysis 13 (2009) 3–37.

Deb et al. [2002]

K. Deb, A. Pratap, S. Agarwal, T. Meyarivan,

A fast and elitist multiobjective genetic algorithm: NSGA-II,

IEEE transactions on evolutionary computation 6 (2002) 182–197.

Zitzler et al. [2000]

E. Zitzler, K. Deb, L. Thiele,

Comparison of multiobjective evolutionary algorithms: Empirical results,

Evolutionary computation 8 (2000) 173–195.

Zitzler et al. [2001]

E. Zitzler, M. Laumanns, L. Thiele,

SPEA2: Improving the strength pareto evolutionary algorithm,

TIK-report 103 (2001).

Corne et al. [2001]

D. W. Corne, N. R. Jerram, J. D. Knowles, M. J. Oates,

PESA-II: Region-based selection in evolutionary multiobjective optimization,

in: Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation, Morgan Kaufmann Publishers Inc., 2001, pp. 283–290.

Srinivas and Deb [1994]

N. Srinivas, K. Deb,

Muiltiobjective optimization using nondominated sorting in genetic algorithms,

Evolutionary computation 2 (1994) 221–248.

Fawcett [2006]

T. Fawcett,

An introduction to roc analysis,

Pattern recognition letters 27 (2006) 861–874.

Agrawal et al. [1995]

R. B. Agrawal, K. Deb, R. Agrawal,

Simulated binary crossover for continuous search space,

Complex systems 9 (1995) 115–148.

Kakde [2004]

M. R. O. Kakde,

Survey on multiobjective evolutionary and real coded genetic algorithms,

in: Proceedings of the 8th Asia Pacific symposium on intelligent and evolutionary systems, Citeseer, 2004, pp. 150–161.

Beyer and Deb [2001]

H.-G. Beyer, K. Deb,

On self-adaptive features in real-parameter evolutionary algorithms,

IEEE Transactions on evolutionary computation 5 (2001) 250–270.

López et al. [2013]

V. López, A. Fernández, S. García, V. Palade, F. Herrera,

An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,

Information Sciences 250 (2013) 113–141.

Pietraszek [2007]

T. Pietraszek,

Classification of intrusion detection alerts using abstaining classifiers,

Intelligent Data Analysis 11 (2007) 293–316.

Alcalá-Fdez et al. [2011]

J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera,

Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.,

Journal of Multiple-Valued Logic & Soft Computing 17 (2011).

Santos-Pereira and Pires [2005]

C. M. Santos-Pereira, A. M. Pires,

On optimal reject rules and roc curves,

Pattern recognition letters 26 (2005) 943–952.

Khemchandani et al. [2007]

R. Khemchandani, S. Chandra, et al.,

Twin support vector machines for pattern classification,

IEEE Transactions on pattern analysis and machine intelligence 29 (2007) 905–910.

Lin et al. [2017]

D. Lin, L. Sun, K.-A. Toh, J. B. Zhang, Z. Lin,

Twin SVM with a reject option through roc curve,

Journal of the Franklin Institute (2017).

Simeone et al. [2012]

P. Simeone, C. Marrocco, F. Tortorella,

Design of reject rules for ecoc classification systems,

Pattern Recognition 45 (2012) 863–875.

He and Garcia [2008]

H. He, E. A. Garcia,

Learning from imbalanced data,

IEEE Transactions on Knowledge & Data Engineering (2008) 1263–1284.

Xiao et al. [2017]

W. Xiao, J. Zhang, Y. Li, S. Zhang, W. Yang,

Class-specific cost regulation extreme learning machine for imbalanced classification,

Neurocomputing 261 (2017) 70–82.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chen et al. [2018] X. Chen, P. Wang, Y. Hao, M. Zhao, Evidential KNN-based condition monitoring and early warning method with applications in power plant, Neurocomputing (2018).
2Kang et al. [2017] S. Kang, S. Cho, S.-j. Rhee, K.-S. Yu, Reliable prediction of anti-diabetic drug failure using a reject option, Pattern Analysis and Applications 20 (2017) 883–891.
3Lin et al. [2018] D. Lin, L. Sun, K.-A. Toh, J. B. Zhang, Z. Lin, Biomedical image classification based on a cascade of an svm with a reject option and subspace analysis, Computers in biology and medicine 96 (2018) 128–140.
4Wang et al. [2017] Z. Wang, Z. Wang, S. He, X. Gu, Z. F. Yan, Fault detection and diagnosis of chillers using bayesian network merged distance rejection and multi-source non-sensor information, Applied energy 188 (2017) 200–214.
5Chow [1970] C. Chow, On optimum recognition error and reject tradeoff, IEEE Transactions on information theory 16 (1970) 41–46.
6Chow [1957] C.-K. Chow, An optimum character recognition system using decision functions, IRE Transactions on Electronic Computers (1957) 247–254.
7Tortorella [2004] Tortorella, Reducing the classification cost of support vector classifiers through an ROC-based reject rule, Pattern Analysis and Applications 7 (2004) 128–143.
8Tortorella [2000] F. Tortorella, An optimal reject rule for binary classifiers, in: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2000, pp. 611–620.