Logitron: Perceptron-augmented classification model based on an extended logistic loss function
Hyenkyun Woo

TL;DR
Logitron introduces a novel classification framework combining extended logistic and Perceptron losses, connecting SVM and logistic regression, with flexible parameterization that improves classification accuracy.
Contribution
This work proposes the Logitron model, a new convex classification method that unifies SVM and logistic regression through a parameterized extended logistic loss.
Findings
Hinge-Logitron with k=4 outperforms logistic regression and SVM in accuracy.
Even with k=-1, Hinge-Logitron maintains classification calibration and efficiency.
The model demonstrates low computational cost and flexible loss function design.
Abstract
Classification is the most important process in data analysis. However, due to the inherent non-convex and non-smooth structure of the zero-one loss function of the classification model, various convex surrogate loss functions such as hinge loss, squared hinge loss, logistic loss, and exponential loss are introduced. These loss functions have been used for decades in diverse classification models, such as SVM (support vector machine) with hinge loss, logistic regression with logistic loss, and Adaboost with exponential loss and so on. In this work, we present a Perceptron-augmented convex classification framework, {\it Logitron}. The loss function of it is a smoothly stitched function of the extended logistic loss with the famous Perceptron loss function. The extended logistic loss function is a parameterized function established based on the extended logarithmic function and theâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Computational Drug Discovery Methods · Spectroscopy and Chemometric Analyses
MethodsSupport Vector Machine
Logitron: Perceptron-augmented classification model based on an extended logistic loss function
Hyenkyun Woo School of Liberal Arts, Korea University of Technology and Education, â [email protected], [email protected]
Abstract
Classification is the most important process in data analysis. However, due to the inherent non-convex and non-smooth structure of the zero-one loss function of the classification model, various convex surrogate loss functions such as hinge loss, squared hinge loss, logistic loss, and exponential loss are introduced. These loss functions have been used for decades in diverse classification models, such as SVM (support vector machine) with hinge loss, logistic regression with logistic loss, and Adaboost with exponential loss and so on. In this work, we present a Perceptron-augmented convex classification framework, Logitron. The loss function of it is a smoothly stitched function of the extended logistic loss with the famous Perceptron loss function. The extended logistic loss function is a parameterized function established based on the extended logarithmic function and the extended exponential function. The main advantage of the proposed Logitron classification model is that it shows the connection between SVM and logistic regression via polynomial parameterization of the loss function. In more details, depending on the choice of parameters, we have the Hinge-Logitron which has the generalized -th order hinge-loss with an additional -th root stabilization function and the Logistic-Logitron which has a logistic-like loss function with relatively large . Interestingly, even , Hinge-Logitron satisfies the classification-calibration condition and shows reasonable classification performance with low computational cost. The numerical experiment in the linear classifier framework demonstrates that Hinge-Logitron with (the fourth-order SVM with the fourth root stabilization function) outperforms logistic regression, SVM, and other Logitron models in terms of classification accuracy.
Index Terms:
Extended exponential function, extended logarithmic function, logistic regression, extended logistic regression, sigmoid, extended sigmoid function, hinge loss, higher-order hinge loss, support vector machine, Perceptron
I Introduction
Learning a decision boundary for the classification of data observed in a real world is a fundamental and important process in machine learning [31, 36] and thus various classification models are introduced during the last several decades; for instance, logistic regression [14], SVM (support vector machine) [39], decision trees [8], random forests [9], neural networks [35, 5], and boosting [19, 21, 12] have been developed. Among these diverse classification models, logistic regression is a probability-based popular model [37]. In this work, we are mainly interested in a convex classification model Logitron built up with the classic Perceptron loss function and the extended logistic loss function, which is not a specific loss function but a polynomial parameterized loss function based on the extended logarithmic function [42] and the extended exponential function [43]. Note that the extended logistic loss function includes a lot of surrogate loss functions appearing in various margin-based classification models. For instance, unhinge loss [34], exponential loss [19], logistic loss [14, 21], sigmoid function [30] and its variant Savage loss [28], and so on. Among them, the non-convex loss functions or unbounded convex loss function, e.g., sigmoid, Savage loss, and unhinge loss, are mainly used for robust boosting classification model. Last but not least, [16] has introduced -logistic regression based on the -exponential family for robustness of the classification model.
Let us start with the standard binary classification model [27, 36, 39]. A formal binary classifier is simply defined as where if and otherwise. Here is a predictor (or score function) and is a feature space and is a constant. Note that where is a function space defined based on a category of classification models. For instance, when we learn a hyper-plane of the feature space, we set with and is a constant. For more advanced classification models such as ensemble learning models and (deep) neural networks, a sophisticated function space is required. For ensemble learning models [36], i.e., boosting and bagging, where is a function space of so-called base (or weak) classifiers. For (deep) neural network [44, 23, 17, 13, 5], with and . Here , which is known as an activation function, is the only nonlinear function in neural network. A typical example is the sigmoid function [13]. Recently, function-based rectified linear unit (i.e., ReLU) is used as an activation function for deep neural network [23]. For kernel-based learning model, which is a straightforward extension of the linear classifier, we can set . For more details on various classification model and the corresponding function space, see [31, 27, 44, 7] and references therein. Unless otherwise stated, in this work, we assume that is a linear function space
[TABLE]
Now, the question is that, from the collected training data with , how can we find the right prediction function minimizing ? A simple approach is to directly minimize the misclassification error (i.e., the zero-one loss function [32]), where and is an indicator function, i.e., if and [math] otherwise. Although the zero-one loss function is simple and easy to understand, it is non-differentiable and non-convex. Finding global optimums of it is a typical NP-hard problems [32]. Instead of using bilevel zero-one loss function, we can consider convex relaxations of that. For instance, we have the classic Perceptron loss function and the corresponding minimization problem (i.e. Perceptron [35]):
[TABLE]
where is linearly penalized with respect to only if . Actually, it is easy to find a solution of the Perceptron model (2) with the subgradient-based method, known as the Perceptron algorithm. The main concern of (2) is that it is sensitive to the noise (or data) near the decision boundary, i.e., In fact, (2) does not have sufficient margin. As a solution of the insufficiency of margin, we can consider higher-order SVM [39, 4, 24]:
[TABLE]
where and is the higher-order hinge-loss function. Especially, when , (3) is the classic SVM, known as the max-margin classifier, with the first-order hinge loss function [39] and when , it is known as L2SVM (or squared SVM) [18]. Recently, the third order hinge loss function is introduced as an activation function for the deep neural network [24]. To the best of authorâs knowledge, -th hinge loss function with is not introduced in literatures. In this work, we study stabilized -th order SVM which has arbitrary within the proposed Logitron framework.
As observed in [31], the misclassification error can also be formulated with the sigmoid probability function and the corresponding classifier . In fact, by using the negative log-likelihood of the Bernoulli distribution which has the sigmoid function as the probability density function, we get the famous logistic loss function and the corresponding logistic regression formulation:
[TABLE]
where . The main advantage of this model is that the logistic loss function is sufficiently smooth and the gradient of it is the sigmoid probability function. That is, let then we have . Though, the logistic regression is a typical example of the margin-based classification model, since for all , it is unclear how to connect this model to the SVM, the max-margin classifier.
The proposed Logitron, having the Perceptron-augmented extended convex logistic loss function, is inherently similar to the logistic regression with an additional margin control parameter. Roughly, we can say that the Logitron is the generalized -th order SVM with an additional stabilization -th root function (). Depending on the choice of parameters, we have the Hinge-Logitron with hinge-like loss function with relatively small value of and the Logistic-Logitron with logistic-like loss function with relatively larger value of . In terms of logistic regression framework, when is relatively large, the generalized -th order SVM corresponds to the exponential function and the stabilization -th root function corresponds to the logarithmic function. Interestingly, even , we have classification model which satisfying the classification-calibrated condition [4]. In fact, when , the Hinge-Logitron is implementable with simple elementary mathematical operations such as division and show reasonable classification performance. Note that the margin of the Logitron loss function is defined as the intersection point of the closure of the domain of the extended exponential function and the Perceptron loss function. When the intersection point is located on the positive real line (), it corresponds to the classic margin. Interestingly, the Logitron loss function is sufficiently smooth on its entire domain under the mild restriction of the parameter and therefore, we can easily use the conventional gradient-based optimization model to find a solution of the Logitron model.
As regards the numerical experiments, for multi-class classification problem, we have used OVA (one-vs-all) framework. The Hinge-Logitron H-4 (i.e., the fourth-order SVM with the fourth-root stabilization function) shows the best performance in learning hyperplanes (1). Compared to the conventional second-order SVM, known as L2SVM [18], the proposed Hinge-Logitron H-2 (i.e., second order SVM with root stabilizer function) shows better performance in terms of classification accuracy. The Logistic-Logitron L- (i.e., a group of the Logitron model with ) shows the best performance with respect to the Friedman ranking [15]. As a by-product of the generalization to the negative region of , we obtain classification-calibrated new classification model. This new classification model also shows better performance than the conventional logistic regression and SVM in terms of the classification accuracy.
I-A Notation
We briefly review a convex function and related useful notations such as extended-valued function. See [33, 26, 10] for more details.
Let be a convex, lower semicontinuous, and proper function on its convex domain
[TABLE]
As observed in [26], the convexity of can be extended to the whole real line by using the extended-valued function :
[TABLE]
where and . Depending on applications [42], can be any convex set in . Unless otherwise stated, as suggested in [26], a convex function in this work is an extended-valued convex function (6) and, for simplicity, we will drop the superscript âeâ in the extended-valued function . In (an extended-valued real number system), we introduce several arithmetic operations with which are useful later. That is, for all , (it means ), (it means ), and (it means with and ).
Let be any convex set in . Then is the interior of and is the boundary of . Here is the closure of . We also set , , , and . The corresponding negative intervals are also defined in the same way. Note that is a set of rational number, is a set of integer, and is a set of natural number. Additionally, is always assumed to be a convex set, irrespective of convexity of .
I-B Overview
The paper is organized as follows. In Section II, we review extended exponential and logarithmic functions which are studied in [42, 43]. In Section III, we introduce the extended logistic loss function defined with the extended exp and log function and the corresponding general classification framework, Logitron. The loss function of it is a smoothed stitching of the Perceptron loss and the restricted version of the extended logistic loss. In Section IV, we reinterpret the Logitron by the generalized -th order SVM with the -th root stabilization function. Here . Actually, L2SVM, known as the SVM with squared hinge loss, can be reformulated into the Hinge-Logitron H-2 with an additional root stabilization function. In Section V, we evaluate the performance of the proposed Logitron with more than one hundred datasets [15]. The conclusions are given in Section VI.
II Extended exponential function and extended logarithmic function
In this Section, we review the extended exponential function [43] and its inverse function, the extended logarithmic function [42]. These extended elementary functions are fundamental ingredients of the extended logistic loss function and the Logitron classification model.
Firstly, let us start with the definition of an extended logarithmic function [42]. It is a generalized logarithmic function [2, 1, 38] with an additional scaling parameter. Later, we will explain the role of an additional parameter in details in terms of the margin of the Logitron classification model.
Definition II.1**.**
Let Then the extended logarithmic function is defined as
[TABLE]
where . After integration, we have a simplified version of it by
[TABLE]
The convexity of depends on parameters and . See [42], for more detail characterization of the domain of . In fact, is rather complicated. As observed in [42], the domain of should be determined to meet the requirement of applications, such as -divergence [42] and statistical Tweedie distribution [43]. If we set , the extended log function becomes the generalized log function [2, 38].
Secondly, we introduce an extended exponential function [43], the scaled version of the generalized exponential function [2, 38, 16]. Note that the scaling parameter of the extended exponential function is very important in the Logitron loss function, since it controls the margin of the classification model unlike the generalized exp function.
Definition II.2**.**
Let and
[TABLE]
where is defined to satisfy the following relation:
[TABLE]
where After integration, we get a simplified version of it by
[TABLE]
If we set , the extended exponential function, becomes generalized exponential function in [2, 16]. The convexity of the extended exp function depends on parameters and thus the structure of is complicated [43]. What is even worse, the extended exponential function defined in Definition II.2 does not have inverse relation with the extended log function defined in Definition II.1. Additionally, as observed in [42, 43], the domains of them should be carefully selected to meet various conditions related to the high level structures. A typical example is a condition of convex function of Legendre type [33]. With the restricted domains satisfying the condition of the convex function of Legendre type, it is possible to obtain rather complicated dual relation between -divergence and the Tweedie distribution [43, 42, 25, 3].
In this work, we are going to use extended exp and log functions for classification purpose only. Hence, we significantly reduce domains of them. See Table I for more details.
Now, and with domains in Table I are convex and extended-valued functions. We summarize various properties of them below.
Proposition II.3**.**
Let . Then the extended exp function and the extended log function have the following properties with the domains in Table I. Here and .
, is strictly increasing and, , is strictly increasing. 2. 2.
* and are convex functions on their domains.*
Proof.
Under domains in Table I, it is easy to see is strictly increasing. In case of , since for all , is strictly increasing when .
- 2)
Since is convex for all , is a convex function on its domain in Table I. For all , we have and convexity can be easily extended to the boundary of the domain in Table I.
Now, we will show that the extended exponential function (11) and the extended logarithmic function (8) are well-defined (i.e., and ). Actually, we show the isomorphic inverse relation between (8) and (11) below under the restricted domains in Table I.
Lemma II.4**.**
. Let and . Then we have the bijective mapping between the the extended log (8) and the extended exp (11) functions with the restricted domains in Table I:
[TABLE]
with the corresponding inverse map
[TABLE]
Note that the proof of Lemma II.4 can be easily derived from Table I and the definition of the extended exp (11) and extended log (8). The following Lemma is useful while we define the loss function for the classification model. In fact, the range of the extended exponential function always equals to the domain of the extended logarithmic function, irrespective of choice of parameters and .
Lemma II.5**.**
For any , , and domains in Table I, we have
[TABLE]
Proof.
Due to the isomorphic mapping in Lemma II.4 (i.e., and ) on domains defined in Table I, we have
[TABLE]
As observed in Table I, the domain of does not depend on the choice of and . Hence, we have
[TABLE]
for any choice of and .
The independency of the parameter and introduced in Lemma II.5 is very useful while we characterize the structure of the extended logistic loss function in the coming Section.
III Logitron: An extended Logistic regression classification model augmented with the Perceptron
This Section introduces a general classification framework. That is, the Logitron classification model with the Perceptron-augmented extended logistic loss function.
Let us start with the extended logistic loss function, which is a simple combination of and in the logistic regression style. In fact, it covers many loss functions appearing in classification such as exponential loss, (extended) sigmoid function, the Savage loss function and so on.
Definition III.1**.**
Let and . Then the extended logistic loss function is defined as
[TABLE]
where . Note that is the restricted domain in Table I.
By virtue of Lemma II.5, the extended logistic loss in (13) is well defined with the restricted domain in Table I, irrespective of choices of and . The classic logistic loss (4) is recovered when we set , irrespective of the choice of the auxiliary parameter . Since we do not put any constraints on and , it is questionable when the extended logistic loss  (13) is acting like the conventional logistic loss function (4). The following theorem gives a partial answer in terms of convexity of  (13).
Theorem III.2**.**
Let with and . Then the extended logistic loss function  (13) is convex on .
Proof.
Let us assume that . Then and . For all , we have
[TABLE]
Now, we only need to extend convexity to the boundary From Table I, we have when . In fact, from the convexity of , we have
[TABLE]
where , and By sending , we can easily extend convexity up to the .
Although the nonconvex extended logistic loss () is not main concern of this work, it is worth mentioning about the nonconvex loss function. As observed in [20], a nonconvex loss has some advantages in terms of robustness against the label noise. Actually, various nonconvex loss functions are proposed in boosting [30, 20, 32, 28], most of them are a subclass of the extended logistic loss. In the following example, we demonstrate higher-order sigmoid function which is a typical example of the nonconvex extended logistic loss function (13). They are known as the robust loss function in boosting [28] or activation function [13, 35] in (multilayer Perceptron) neural network.
Example III.3** **(higher-order sigmoid function).
Let us consider the extended logistic loss function with () and (higher-order sigmoid function):
[TABLE]
where is a sigmoid function and . Note that for all . In fact, the Savege loss function [28] is the second-order sigmoid function () and the activation function in multilayer Perceptron neural network [13, 35] is the first-order sigmoid function ().
- âą
First-order sigmoid (): where .
- âą
Second-order sigmoid (): where , and for all . In **[28]**, authors have introduced as the Savage loss function in boosting. This model is known to be more robust to label noise compared to other boosting models having convex loss functions such as Adaboost **[19]** and LogitBoost **[21]**. However, within the convex loss function, the LogitBoost with logistic loss is more robust than the Adaboost with the exponential loss **[21]**.
Since we are mainly interested in convex loss function, having similar features of the loss functions used in logistic regression and SVM, we restrict the extended logistic loss function (13) by the following condition.
[TABLE]
Now, let us simplify the notation of the extended logistic regression function with the extended-valued function by
[TABLE]
Here is a model parameter and is a margin parameter. As observed in Figure 1 and 2, the search space of two parameters are significantly reduced and thus they are not a big burden while running the cross-validation. The only concern of (16) is that the domain depends on (see Table I). This is definitely a barrier for various applications appearing in machine learning. However, interestingly, the domain dependency problems of (16) could be easily escaped by using the Perceptron loss function. We call the Perceptron-augmented loss function of (16) as the Logitron loss function and the corresponding minimization model for classification as Logitron. The details are following.
Definition III.4** **(Logitron).
Let be the given training dataset. Here , , and . Also, we set . Then we have the Logitron model:
[TABLE]
where is an appropriate function space such as (1) and is the Logitron loss function defined by
[TABLE]
where is the Perceptron loss function in (2) and
Since the Perceptron loss is added to the extended logistic loss function, we have . That is, the domain of the Logitron is the entire real line. Moreover, the Logitron loss is continuously twice differentiable on its entire domain under the mild condition. See also Figure 1 and 2 for the graph of the Logistic loss and the gradient of it.
Theorem III.5**.**
Let . Then the Logitron loss function  (18) is convex and continuous for all . When , it is continuously differentiable. Moreover, if then it is continuously twice differentiable.
Proof.
When , we get the logistic loss in (4). Thus, for all and is infinitely differentiable on . Let us consider .
Firstly, for the continuity of , we only need to show that
[TABLE]
where and .
- âą
: We have and . Thus, . Therefore, we get Additionally, since is strictly increasing and convex (Proposition II.3 (1) and (2)), we have for all . Therefore, since is strictly increasing and , we have . Additionally, for all , we have from .
- âą
: We have and . Thus . Since , we need to be cautious on the boundary point. From the extended-valued real number system, we have and thus
[TABLE]
Note that it is easy to check that for all and for all .
Secondly, we will show continuously differentiability of the Logitron, on its entire domain .
- âą
: and . By simple calculation, we have
[TABLE]
where Since , as , we get and . Therefore, is well defined for all .
- âą
: and . Since , we have
[TABLE]
where for all . On the other hand, since at , we have .
Thirdly, for continuously twice differentiability of the Logitron loss, let us take the second derivative of . Then, and
[TABLE]
Let us consider the case and .
- âą
: From , we get
[TABLE]
- âą
: From and , we have Thus, .
Additionally, it is trivial that . Finally, and , we get the continuous second derivative of the Logitron loss function
[TABLE]
Due to the Theorem III.5, the Logitron loss function can be used as a classification loss function for all and . In fact, it is classification-calibrated [4].
Corollary III.6**.**
For all and , the Logitron loss function is classification-calibrated [4].
Proof.
From Theorem III.5, is convex and differentiable at for all .
[TABLE]
Therefore, we have for all . Hence is classification-calibrated, irrespective of the choice of .
Additionally, the Logitron loss function (18) is sufficiently smooth. That is, the gradient of it is continuous on its entire domain and bounded by one. Therefore, we could use any gradient-based optimization method such as L-BFGS [29].
Corollary III.7**.**
For all and , the Logitron loss function is Lipschitz continuous with Lipschitz constant one for all . That is, we have
[TABLE]
for all .
Proof.
Let . From (19) and (20), we have . Moreover, when , we get . Here is a subgradient of .
Before we go further, it is worth mentioning about the (un)hinge loss function. The extended logistic loss function with has an unconventional hinge loss function, known as unhinged loss function [34]. In fact, the extended logistic loss under the domains in Definition II.1 and II.2 becomes where . The main advantage of this unhinged loss function is that it is robust to symmetric label noise. In fact, as observed in [34], if the convex function is lower bounded, then it is not robust to symmetric label noise. However, the Logitron with is the hinge loss function which can be reformulated with first-order hinge loss function in (3):
[TABLE]
The Logitron with can be regarded as the smoothed hinge loss when we set . Actually, the -th order hinge loss in (3) with an additional -th root stabilizer function, is a special case of the Logitron with and . Also, as observed in Figure 1 and 2, the Logitron loss function with behaves like the logistic loss function when . Therefore, it is natural to separate the Logitron into the two category; one is the hinge-like Logitron loss function and the other is the logistic-like Logitron loss function. In the coming Section IV, we analyze the Logitron model in two different points of view.
IV The Low complexity Logitron with
In the previous Section, we found that the Logitron has many useful properties such as smoothness and classification-calibration. However, to be more practical in terms of computation, we need to reduce the spaces of model parameter and of margin parameter . In this Section, we introduce the low complexity Logitron loss function (18) with based on higher-order hinge loss in (3). With additional restriction of and , we have two different categories of the Logitron; one is the hinge loss-like Logitron (Hinge-Logitron) and the other is the logistic loss-like Logitron (Logistic-Logitron).
Let us start with the generalized -th order hinge loss function. As stated in [22], it corresponds to a basis function of the generalized -th order spline.
[TABLE]
where . Interestingly, the low complexity Logitron with can be easily reformulated with the generalized -th order hinge loss in (23). In fact, let us modify the extended exponential function with :
[TABLE]
where . Then we have a connection between (23) and (24):
[TABLE]
If we set and then becomes the generalized -th order hinge loss function with an additional margin control parameter . Now, let us reformulate the Logitron with the modified extended exponential function  (24).
Theorem IV.1**.**
Let and . Then the low complexity Logitron can be reformulated with  (24):
[TABLE]
where
[TABLE]
Proof.
When , we get the first-order hinge loss in (3). Now, let us assume that , then we have . In this region, it is easy to see . Also, when , we get and, from the definition of the Logitron loss in (18), . Now let use assume that . For all , it is easy to check . On the other hand, let , then and thus we have and . Here, with an additional function, we have Note that if then . Therefore, when (i.e., ), we have
Note that, in Theorem IV.1, though we have restricted the range of for practical reason, it can be extended to . When we set , the Logitron loss is similar to the generalized -th order hinge loss function (23). However, if we set then the role of the extended exp and log is the approximation of the conventional exp and log function. Especially, when , the Logitron loss function almost equals to the logistic loss function. See Figure 1 and 2. As a consequence, we have four different categories of the Logitron loss function based on the model parameter and the margin parameter .
- âą
Hinge-Logitron (): H-Logitron ( and ) and H+Logitron ( and )
- âą
Logistic-Logitron (): L-Logitron ( and ) and L+Logitron ( and )
Since the parameter of the generalized -th order hinge-loss function (23) can be negative, the classic margin concept is also required to be generalized. We call the classic margin as positive margin if the loss function touch the Perceptron loss on the positive axis. On the other hand, if the loss function touch the Perceptron loss on the negative axis, then we call that touch point as the negative margin. Actually, the positive margin () and negative margin () equals to the value of (i.e., ). Therefore, since the logistic regression does not touch the Perceptron loss, it does not have margin. In Hinge-Logitron, the H-Logitron loss function has positive margin like the higher-order hinge loss (3). Figure 1 (c) compare the H-Logitron with the first-order hinge-loss (SVM) and the second order hinge-loss (L2SVM). However, the H+Logitron loss function has negative margin through the Perceptron line (i.e., ). See Figure 2 (c) for the shape of the H+Logitron with various different choice of and . As regards the Logistic-Logitron model, we have the L-Logitron loss function approximating the logistic loss with positive margin and the L+Logitron loss function approximating the logistic loss with negative margin. See Figure 1 (d) and Figure 2 (d), respectively.
It is useful seeing a direct connection between higher-order hinge loss in (3) and the Logitron loss function with . Here . Then (25) is simplified as with and . Now, when (i.e., ), we have H-Logitron ()
[TABLE]
It actually means that the H-Logitron with and is a higher-order SVM with an additional -th root stabilizer function. As observed in Figure 1 (c), the second-order hinge-loss (L2SVM) highly penalize the misclassified data. On the other hand, the penalty on the misclassified data of the H-Logitron is stabilized, irrespective of the choice of .
When (i.e., ), we get a totally new classification model, H+Logitron.
[TABLE]
where and
[TABLE]
In this instance, we do not have positive margin. That is, for all and for all By controlling (i.e., the negative margin), we obtain the closeness of the H+Logitron to the Perceptron loss function . Though the H+Logitron does not have the classic margin, i.e., the positive margin, however, due to its simple structure of the model, we need to investigate the H+Logitron model in more details. For instance, let then we have
[TABLE]
where . Interestingly, we can remove singularity which existed on the boundary of the domain of the extended exponential function. Moreover, as noticed in Theorem III.5, H+Logitron with has a continuous derivative, . The most important feature of (30) is that we only need division and multiplication for the evaluation of the gradient and the loss function itself. This is the main advantage of (30). As observed in Section V, the performance of it is comparable to logistic regression and SVM. Note that, when , the H+Logitron can be reformulated as
[TABLE]
This model is rather complicated. However, it is also smooth on the entire domain and classification-calibrated. In fact, when , and, when , . It behaves like the conventional margin-based loss function.
V Experiments with various -regularized Logitron models
This Section compare performance of the proposed Logitron with logistic regression and SVM within the linear classification framework.
Let us define the Logitron minimization problem with the linear function space in (1). For simplicity, we use -regularizer, but it could be replaced with a sophisticated regularization model.
[TABLE]
where is the -regularizer, and . Although the loss function is rather complicated, it has many useful properties for gradient-based optimization. Indeed, the loss function is convex and differentiable on , irrespective of the choice of . For simplicity, we use the L-BFGS algorithm in minFunc [29]. It is implemented in the MATLAB framework. Note that we use the famous LIBLINEAR package [18] for the benchmark of the proposed Logitron model. Among various linear classification models in LIBLINEAR, we select typical models; logistic regression (4) and higher-order SVM (3) (the first-order SVM and the second-order SVM (i.e., L2SVM)). For logistic regression, we use the primal formulation (). For SVM, we use the dual formulation (). For L2SVM, we use the primal formulation (). We also use the bias term in LINLINEAR (). Note that all models have -regularization term. As regards the regularization parameter , we simply use the following parameter selection strategy for as recommended in the LIBSVM [11].
[TABLE]
In the models of LIBLINEAR, the regularization parameter is located on the loss function and thus we use of (33) for the regularization parameter of them.
In terms of parameter space of Logitron, we need to select not only the regularization parameter but also the model parameter and the margin parameter . From the analysis in the earlier Section IV, we know that the Logitron has four different submodels (H-Logitron, H+Logitron, L-Logitron, and L+Logitron). The H-Logitron is the higher-order SVM with an additional stabilization function (28). For simplicity, we only consider th - th order SVM with the corresponding -th root function. In the category of H+Logitron, we have two sub-models; H+Logitron with  (30) and H+Logitron with  (31). Actually, the minimization problem of the H+Logitron with can be solved by using elementary arithmetics such as division and multiplication. In total, we have nine sub-models; H-1( with i.e., ), H-2 (, i.e., ), H-3 (, i.e., ), H-4 (, i.e., ), H+1 ( with ), H+2 (), H+3 (), L- (), and L+ (). Based on the analysis in Section IV, except H-1 and H+1, the model parameter for all sub-Logitron model is in the category . We summarize the parameter space of each sub-Logitron model in Table II. Four-fold cross validation [15] is used to select the optimal parameters of nine sub-Logitron models and three models of LIBLINEAR. Due to the independency of each cross-validation process, it is easy to be implemented in parallel processing machines.
In terms of benchmark dataset, we use the well-organized datasets in [15] while reporting the performance of the nine sub-Logitron models. In fact, they are pre-processed and normalized in each feature dimension with mean zero and variance one. The raw data are mostly in UCI machine learning repository. Note that, as commented in [40], we reorganize the dataset in [15]. First, each dataset is separated into the training and testing data set which are not overlapped. Each training data set is randomly shuffled for -fold cross validation. Among the dataset in [15], we use datasets after removing ambiguous dataset in terms of data splitting strategy. In Appendix, we list up all information of datasets such as number of instances, number of train data, number of test data, feature dimension, and number of classes. See Table V for more details. Last but not least, for multi-class datasets, we exploit the one-vs-all strategy, the most commonly used in multi-class classification based on a binary classifier. This strategy is also used in LIBLINEAR [18].
The whole experiments are run five times and the averaged test score of each dataset is reported in Table VI and Table VII in Appendix. In each experiment, the best parameters are chosen through the -fold cross-validation. With the chosen best parameters, we minimize (32) with the whole training data in Table V to find the hyperplane, i.e., . Then we evaluate the performance of each classification model with test dataset in Table V. For more details on CV-based minimization, see [11]. All numerical results are summarized in Table III. In terms of classification accuracy, H-Logitron H-4 is the best classification model and L-Logitron L- obtains the best Friedman ranking [15]. The H-Logitron submodels (H-2, H-3, and H-4) are -th order SVMs () with the corresponding -th root stabilization functions. In this category, as we increase the order of the model, the performance is getting better. What is interesting is that H-2 (the second order SVM with root stabilization function) outperforms the classic second order SVM, i.e., L2SVM [18]. A H+Logitron subbmodel, i.e., the cheapest classification model H+2 with (30), also shows comparable performance to the classic logistic regression.
Table IV presents the dataset in the Best- set of each classifier in terms of the relative classification accuracy (racc). Here, the relative classification accuracy (racc) is the subtraction of the accuracy of the virtual DWN in [15] from the accuracy of each classifier. Note that the virtual DWN classifier means the best classifier among classifiers, including boosting, neural network, and random forest, for each individual dataset with respect to the classification accuracy. That is, it is not a specific classifier existed in the real world but an idealistic virtual classifier. Although the function space of the Logitron is linear, interestingly, the proposed Logitron model gets better performance than the optimal DWN classifier in some datasets such as âhill-valleyâ, âacute-inflammationâ, âacute-nephritisâ, âheart-hungarianâ, âcredit-approvalâ, etc.
In Figure 4, 5, and 6, we summarize statistical information of the parameters , , and (or ) which are selected via -fold cross-validation with the training dataset in Table V. Since we did the whole experiments five times, the histograms are generated with samples. They are normalized for probabilistic interpretation of the parameter data. For each model, we plot histograms of two datasets; Best-(Left) and remainders (Right). Figure 4 shows the normalized histogram of the with respect to . The regularization parameter of all Logitron sub-models for the Best- set are mainly located near or . Note that Logitron is not convex with respect to , , and at the same time. Thus, there are many local minima during the selection process of the regularization parameter with cross-validation. Due to the inherent ambiguity, we have many candidate for the best regularization parameter. Therefore, when the training accuracies are even, we simply select a regularization parameter having smaller value. As a result of the regularization parameter selection process, we have relatively high frequency at . Figure 5 visualizes for various different Logitron sub-models. The Logitron with (i.e., H-1 and L-) in Best- prefers smaller value of than the remainder set. Figure 6 demonstrates the preference of the margin parameter in the Logitron submodels; H-2, H-3, H-4 and H+2, H+3. Overall, H-3,H-4, and H+3 in Best- prefer to the remainder set.
VI Conclusion
In this article, we have introduced a general convex classification framework, i.e., Logitron, which is an extended logistic loss function with the classic Perceptron loss function. The proposed Logitron has several useful features. A typical one is that it is differentiable on the whole real line for all . Therefore, it is easy to use the conventional optimization algorithm. Depending on the choice of the parameters, we have two different categories of models; the Hinge-Logitron model () and the Logistic-Logitron model (). A Hinge-Logitron model H-4 (the fourth-order SVM with an additional fourth root function) outperforms the various other sub-Logitron models and the models in LIBLINEARÂ [18] in terms of classification accuracy. Additionally, a simple classification model H+2 shows reasonable performance compared to the classic logistic regression. A Logistic-Logitron model L- shows the best performance in terms of Friedman ranking.
Acknowledgments
This article is supported by the Basic Science Program through the NRF of Korea funded by the Ministry of Education (NRF-2015R101A1A01061261). The Logitron is designed based on the machine learning MATLAB package which is available in https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Amari and H. Nagaoka, Methods of Information Geometry , AMS, 2000.
- 2[2] S. Amari, Information geometry and its applications , Springer, 2016.
- 3[3] A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, âClustering with Bregman Divergencesâ, J. of Mach. Learn. Res. , 6 (2005), pp. 1705-1749.
- 4[4] P.L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe, âConvexity, classification, and risk boundsâ, J. of the American Stat. Association: Theory and Meth. , 101 (2006) pp.138-156.
- 5[5] Y. Bengio, Y. Lecun, and G. Hinton, âDeep Learningâ, 521 (2015) pp.436-444.
- 6[6] L. Bottou and O. Bousquet, âThe tradeoffs of large scale learningâ, NIPS (2008).
- 7[7] S. Boucheron, O. Bousquet, and G. Lugosi, âTheory of classification: A survey of some recent advancesâ, ESAIM: Prob. and Stat. , 9 (2005) pp.323-375.
- 8[8] L. Breiman, J. Friedman, R. Olshen, and C. Stone, âClassification and regression treesâ Wadsworth and Brooks/Cole Advanced Books and Software.
