Logitron: Perceptron-augmented classification model based on an extended   logistic loss function

Hyenkyun Woo

arXiv:1904.02958·cs.LG·April 8, 2019

Logitron: Perceptron-augmented classification model based on an extended logistic loss function

Hyenkyun Woo

PDF

Open Access

TL;DR

Logitron introduces a novel classification framework combining extended logistic and Perceptron losses, connecting SVM and logistic regression, with flexible parameterization that improves classification accuracy.

Contribution

This work proposes the Logitron model, a new convex classification method that unifies SVM and logistic regression through a parameterized extended logistic loss.

Findings

01

Hinge-Logitron with k=4 outperforms logistic regression and SVM in accuracy.

02

Even with k=-1, Hinge-Logitron maintains classification calibration and efficiency.

03

The model demonstrates low computational cost and flexible loss function design.

Abstract

Classification is the most important process in data analysis. However, due to the inherent non-convex and non-smooth structure of the zero-one loss function of the classification model, various convex surrogate loss functions such as hinge loss, squared hinge loss, logistic loss, and exponential loss are introduced. These loss functions have been used for decades in diverse classification models, such as SVM (support vector machine) with hinge loss, logistic regression with logistic loss, and Adaboost with exponential loss and so on. In this work, we present a Perceptron-augmented convex classification framework, {\it Logitron}. The loss function of it is a smoothly stitched function of the extended logistic loss with the famous Perceptron loss function. The extended logistic loss function is a parameterized function established based on the extended logarithmic function and the…

Equations92

F = {⟨ w, x ⟩ + b ∣ w \in W (n), b \in R, and x \in X}

F = {⟨ w, x ⟩ + b ∣ w \in W (n), b \in R, and x \in X}

f \in F min i = 1 \sum N ℓ_{P} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N ℓ_{P} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N ℓ_{H, k} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N ℓ_{H, k} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N ℓ_{L} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N ℓ_{L} (y_{i} f (x_{i}))

dom (h) = {z \in R ∣ h (z) \neq = \emptyset} .

dom (h) = {z \in R ∣ h (z) \neq = \emptyset} .

h^{e}(z)=\left\{\begin{array}[]{l}h(z)\quad z\in\Omega\\ +\infty\quad z\not\in\Omega\end{array}\right.

h^{e}(z)=\left\{\begin{array}[]{l}h(z)\quad z\in\Omega\\ +\infty\quad z\not\in\Omega\end{array}\right.

ln_{α, c} (u) = \int_{c}^{u} x^{- α} d x

ln_{α, c} (u) = \int_{c}^{u} x^{- α} d x

\ln_{\alpha,c}(u)=\left\{\begin{array}[]{l}\ln\left(\frac{u}{c}\right),\hskip 65.44142pt\hbox{ if }\alpha=1\\ \frac{1}{1-\alpha}(u^{1-\alpha}-c^{1-\alpha}),\quad\hbox{ otherwise }\end{array}\right.

\ln_{\alpha,c}(u)=\left\{\begin{array}[]{l}\ln\left(\frac{u}{c}\right),\hskip 65.44142pt\hbox{ if }\alpha=1\\ \frac{1}{1-\alpha}(u^{1-\alpha}-c^{1-\alpha}),\quad\hbox{ otherwise }\end{array}\right.

exp_{α, c} (v) = y

exp_{α, c} (v) = y

v = \int_{c}^{y} x^{- α} d x

v = \int_{c}^{y} x^{- α} d x

\exp_{\alpha,c}(v)=\left\{\begin{array}[]{l}c\exp(v),\hskip 82.51282pt\hbox{ if }\alpha=1\\ (c^{1-\alpha}+(1-\alpha)v)^{1/(1-\alpha)},\quad\hbox{ otherwise }\end{array}\right.

\exp_{\alpha,c}(v)=\left\{\begin{array}[]{l}c\exp(v),\hskip 82.51282pt\hbox{ if }\alpha=1\\ (c^{1-\alpha}+(1-\alpha)v)^{1/(1-\alpha)},\quad\hbox{ otherwise }\end{array}\right.

ln_{α, c} : dom (ln_{α, c}) \to dom (exp_{α, c})

ln_{α, c} : dom (ln_{α, c}) \to dom (exp_{α, c})

exp_{α, c} : dom (exp_{α, c}) \to dom (ln_{α, c})

exp_{α, c} : dom (exp_{α, c}) \to dom (ln_{α, c})

ran (exp_{β, c_{2}}) = dom (ln_{α, c_{1}})

ran (exp_{β, c_{2}}) = dom (ln_{α, c_{1}})

ran (exp_{α, c}) = dom (ln_{α, c})

ran (exp_{α, c}) = dom (ln_{α, c})

ran (exp_{β, c_{2}}) = dom (ln_{α, c_{1}})

ran (exp_{β, c_{2}}) = dom (ln_{α, c_{1}})

ℓ_{α, β, c} (x) = ln_{α, c} (c + exp_{β, c} (- x))

ℓ_{α, β, c} (x) = ln_{α, c} (c + exp_{β, c} (- x))

\frac{d ^{2} ℓ _{α, β, c}}{d p ^{2}} = [β (c + h (p)) - α h (p)] \frac{h ( p ) ^{2 β - 1}}{( c + h ( p ) ) ^{α + 1}} \geq 0

\frac{d ^{2} ℓ _{α, β, c}}{d p ^{2}} = [β (c + h (p)) - α h (p)] \frac{h ( p ) ^{2 β - 1}}{( c + h ( p ) ) ^{α + 1}} \geq 0

ℓ_{α, β, c} (λa + (1 - λ) b) \leq λ ℓ_{α, β, c} (a) + (1 - λ) ℓ_{α, β, c} (b)

ℓ_{α, β, c} (λa + (1 - λ) b) \leq λ ℓ_{α, β, c} (a) + (1 - λ) ℓ_{α, β, c} (b)

ℓ_{m + 1, 1, c} (z) = c_{m + 1} (1 - \frac{1}{[ 1 + exp ( - z ) ] ^{m}}) = c_{m + 1} (1 - σ^{m} (z))

ℓ_{m + 1, 1, c} (z) = c_{m + 1} (1 - \frac{1}{[ 1 + exp ( - z ) ] ^{m}}) = c_{m + 1} (1 - σ^{m} (z))

α = β \geq 0

α = β \geq 0

\boxed{\ell_{\alpha,c}(x)=\left\{\begin{array}[]{l}\ln_{\alpha,c}(c+\exp_{\alpha,c}(-x))\quad\hbox{ if }x\in\hbox{dom}(\ell_{\alpha,c})\\ +\infty\hskip 85.35826pt\hbox{ otherwise }\end{array}\right.}

\boxed{\ell_{\alpha,c}(x)=\left\{\begin{array}[]{l}\ln_{\alpha,c}(c+\exp_{\alpha,c}(-x))\quad\hbox{ if }x\in\hbox{dom}(\ell_{\alpha,c})\\ +\infty\hskip 85.35826pt\hbox{ otherwise }\end{array}\right.}

f \in F min i = 1 \sum N L_{α, c} (y_{i} f (x_{i}))

f \in F min i = 1 \sum N L_{α, c} (y_{i} f (x_{i}))

L_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell_{\alpha,c}(z)\qquad\hbox{ if }z\in\hbox{dom}(\ell_{\alpha,c})\\ \ell_{P}(z)\qquad\;\hbox{ otherwise }\end{array}\right.

L_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell_{\alpha,c}(z)\qquad\hbox{ if }z\in\hbox{dom}(\ell_{\alpha,c})\\ \ell_{P}(z)\qquad\;\hbox{ otherwise }\end{array}\right.

\ell_{\alpha,c}(z)\cap\ell_{P}(z)=\left\{\begin{array}[]{l}\emptyset\hskip 42.67912pt\hbox{ if }z\in int(\Omega)\\ -z\;\hbox{ or }\;0\quad\hbox{ if }z\in bd(\Omega)\end{array}\right.

\ell_{\alpha,c}(z)\cap\ell_{P}(z)=\left\{\begin{array}[]{l}\emptyset\hskip 42.67912pt\hbox{ if }z\in int(\Omega)\\ -z\;\hbox{ or }\;0\quad\hbox{ if }z\in bd(\Omega)\end{array}\right.

ℓ_{α, c} (- c_{α}) = ln_{α, c} (c + exp_{α, c} (c_{α})) = ln_{α, c} (c + \infty) = c_{α}

ℓ_{α, c} (- c_{α}) = ln_{α, c} (c + exp_{α, c} (c_{α})) = ln_{α, c} (c + \infty) = c_{α}

L^{\prime}_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell^{\prime}_{\alpha,c}(z)\quad\hbox{ if }z\in int(\hbox{dom}(\ell_{\alpha,c}))\\ 0\hskip 39.83368pt\hbox{ if }z\in{\mathbb{R}}\setminus\hbox{dom}(\ell_{\alpha,c})\end{array}\right.

L^{\prime}_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell^{\prime}_{\alpha,c}(z)\quad\hbox{ if }z\in int(\hbox{dom}(\ell_{\alpha,c}))\\ 0\hskip 39.83368pt\hbox{ if }z\in{\mathbb{R}}\setminus\hbox{dom}(\ell_{\alpha,c})\end{array}\right.

L^{\prime}_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell^{\prime}_{\alpha,c}(z)\quad\hbox{ if }z\in\hbox{dom}(\ell_{\alpha,c})\\ -1\hskip 34.14322pt\hbox{ otherwise}\end{array}\right.

L^{\prime}_{\alpha,c}(z)=\left\{\begin{array}[]{l}\ell^{\prime}_{\alpha,c}(z)\quad\hbox{ if }z\in\hbox{dom}(\ell_{\alpha,c})\\ -1\hskip 34.14322pt\hbox{ otherwise}\end{array}\right.

ℓ_{α, c}^{''} (z) = c α exp_{α, c} (- z)^{α - 2} (\frac{exp _{α, c} ( - z )}{c + exp _{α, c} ( - z )})^{α + 1} > 0.

ℓ_{α, c}^{''} (z) = c α exp_{α, c} (- z)^{α - 2} (\frac{exp _{α, c} ( - z )}{c + exp _{α, c} ( - z )})^{α + 1} > 0.

ℓ_{α, c}^{''} (- c_{α}) = \frac{c α exp _{α, c}^{2 α - 1} ( c _{α} )}{( c + exp _{α, c} ( c _{α} ) ) ^{α + 1}} = 0

ℓ_{α, c}^{''} (- c_{α}) = \frac{c α exp _{α, c}^{2 α - 1} ( c _{α} )}{( c + exp _{α, c} ( c _{α} ) ) ^{α + 1}} = 0

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Computational Drug Discovery Methods · Spectroscopy and Chemometric Analyses

MethodsSupport Vector Machine

Full text

Logitron: Perceptron-augmented classification model based on an extended logistic loss function

Hyenkyun Woo School of Liberal Arts, Korea University of Technology and Education, [email protected], [email protected]

Abstract

Classification is the most important process in data analysis. However, due to the inherent non-convex and non-smooth structure of the zero-one loss function of the classification model, various convex surrogate loss functions such as hinge loss, squared hinge loss, logistic loss, and exponential loss are introduced. These loss functions have been used for decades in diverse classification models, such as SVM (support vector machine) with hinge loss, logistic regression with logistic loss, and Adaboost with exponential loss and so on. In this work, we present a Perceptron-augmented convex classification framework, Logitron. The loss function of it is a smoothly stitched function of the extended logistic loss with the famous Perceptron loss function. The extended logistic loss function is a parameterized function established based on the extended logarithmic function and the extended exponential function. The main advantage of the proposed Logitron classification model is that it shows the connection between SVM and logistic regression via polynomial parameterization of the loss function. In more details, depending on the choice of parameters, we have the Hinge-Logitron which has the generalized $k$ -th order hinge-loss with an additional $k$ -th root stabilization function and the Logistic-Logitron which has a logistic-like loss function with relatively large $|k|$ . Interestingly, even $k=-1$ , Hinge-Logitron satisfies the classification-calibration condition and shows reasonable classification performance with low computational cost. The numerical experiment in the linear classifier framework demonstrates that Hinge-Logitron with $k=4$ (the fourth-order SVM with the fourth root stabilization function) outperforms logistic regression, SVM, and other Logitron models in terms of classification accuracy.

Index Terms:

Extended exponential function, extended logarithmic function, logistic regression, extended logistic regression, sigmoid, extended sigmoid function, hinge loss, higher-order hinge loss, support vector machine, Perceptron

I Introduction

Learning a decision boundary for the classification of data observed in a real world is a fundamental and important process in machine learning [31, 36] and thus various classification models are introduced during the last several decades; for instance, logistic regression [14], SVM (support vector machine) [39], decision trees [8], random forests [9], neural networks [35, 5], and boosting [19, 21, 12] have been developed. Among these diverse classification models, logistic regression is a probability-based popular model [37]. In this work, we are mainly interested in a convex classification model Logitron built up with the classic Perceptron loss function and the extended logistic loss function, which is not a specific loss function but a polynomial parameterized loss function based on the extended logarithmic function [42] and the extended exponential function [43]. Note that the extended logistic loss function includes a lot of surrogate loss functions appearing in various margin-based classification models. For instance, unhinge loss [34], exponential loss [19], logistic loss [14, 21], sigmoid function [30] and its variant Savage loss [28], and so on. Among them, the non-convex loss functions or unbounded convex loss function, e.g., sigmoid, Savage loss, and unhinge loss, are mainly used for robust boosting classification model. Last but not least, [16] has introduced $t$ -logistic regression based on the $t$ -exponential family for robustness of the classification model.

Let us start with the standard binary classification model [27, 36, 39]. A formal binary classifier $g_{f}(x)$ is simply defined as $g_{f}(x)=sign(f(x))$ where $sign(f(x))=+1$ if $f(x)>0$ and $-1$ otherwise. Here $f(x):{\cal X}\rightarrow{\mathbb{R}}$ is a predictor (or score function) and ${\cal X}=\{x\in{\mathbb{R}}^{n}\;|\;\|x\|_{\infty}\leq R_{\cal X}\}$ is a feature space and $R_{\cal X}$ is a constant. Note that $f\in{\cal F}$ where ${\cal F}$ is a function space defined based on a category of classification models. For instance, when we learn a hyper-plane of the feature space, we set ${\cal F}=\{\langle w,\,x\rangle+b\;|\;w\in{\cal W}(n),b\in{\mathbb{R}},x\in{\cal X}\}$ with ${\cal W}(n)=\{w\in{\mathbb{R}}^{n}\;|\;\|w\|_{\infty}<R_{\cal W}\}$ and $R_{\cal W}$ is a constant. For more advanced classification models such as ensemble learning models and (deep) neural networks, a sophisticated function space is required. For ensemble learning models [36], i.e., boosting and bagging, ${\cal F}=\{\langle w,\,g(x)\rangle\;|\;w\in{\cal W}(N),g=(g_{1},...,g_{N})\in{\cal B}^{N}\}$ where ${\cal B}$ is a function space of so-called base (or weak) classifiers. For (deep) neural network [44, 23, 17, 13, 5], ${\cal F}={\cal N}_{r}$ with ${\cal N}_{r}=\{\langle w,\,\sigma(f(x))\rangle\;|\;f\in{\cal N}^{N}_{r-1},w\in{\cal W}(N)\}$ and ${\cal N}_{1}=\{\langle w,\,x\rangle\;|\;w\in{\cal W}(N)\}$ . Here $\sigma(x)$ , which is known as an activation function, is the only nonlinear function in neural network. A typical example is the sigmoid function [13]. Recently, $\max$ function-based rectified linear unit (i.e., ReLU) is used as an activation function for deep neural network [23]. For kernel-based learning model, which is a straightforward extension of the linear classifier, we can set ${\cal F}=\{\sum_{i=1}^{N}w_{i}k(x_{i},x)\;|\;\langle w,\,Kw\rangle\in[0,R^{2}_{\cal W}],K=[k(x_{i},x_{j})]\in{\mathbb{R}}^{N\times N},\;x_{i}\hbox{ is an observed data}\}$ . For more details on various classification model and the corresponding function space, see [31, 27, 44, 7] and references therein. Unless otherwise stated, in this work, we assume that ${\cal F}$ is a linear function space

[TABLE]

Now, the question is that, from the collected training data $(x_{1},y_{1}),...,(x_{N},y_{N})\in{\cal X}\times{\cal Y}$ with ${\cal Y}=\{-1,+1\}$ , how can we find the right prediction function $f$ minimizing $\hbox{Prob}(g_{f}(x)\not=y)$ ? A simple approach is to directly minimize the misclassification error (i.e., the zero-one loss function [32]), $\hbox{Prob}(g_{f}(x)\not=y)=\frac{1}{N}\sum_{i=1}^{N}\ell_{0/1}(y_{i}f(x_{i}))$ where $\ell_{0/1}(z)={\bf 1}(-z)$ and ${\bf 1}(\cdot)$ is an indicator function, i.e., ${\bf 1}(a)=1$ if $a>0$ and [math] otherwise. Although the zero-one loss function $\ell_{0/1}$ is simple and easy to understand, it is non-differentiable and non-convex. Finding global optimums of it is a typical NP-hard problems [32]. Instead of using bilevel zero-one loss function, we can consider convex relaxations of that. For instance, we have the classic Perceptron loss function $\ell_{P}(z)=\max(0,-z)$ and the corresponding minimization problem (i.e. Perceptron [35]):

[TABLE]

where $\ell_{P}(yf(x))=|f(x)|$ is linearly penalized with respect to $f(x)$ only if $g_{f}(x)\not=y$ . Actually, it is easy to find a solution of the Perceptron model (2) with the subgradient-based method, known as the Perceptron algorithm. The main concern of (2) is that it is sensitive to the noise (or data) near the decision boundary, i.e., $(x,y)\in D(\varepsilon)=\{(x,y)\in{\cal X}\times{\cal Y}\;|\;|yf(x)|\leq\varepsilon\}.$ In fact, (2) does not have sufficient margin. As a solution of the insufficiency of margin, we can consider higher-order SVM [39, 4, 24]:

[TABLE]

where $k\in{\mathbb{N}}$ and $\ell_{H,k}(y_{i}f(x_{i}))=(\max(0,1-y_{i}f(x_{i})))^{k}$ is the higher-order hinge-loss function. Especially, when $k=1$ , (3) is the classic SVM, known as the max-margin classifier, with the first-order hinge loss function [39] and when $k=2$ , it is known as L2SVM (or squared SVM) [18]. Recently, the third order hinge loss function $\ell_{H,3}(z)$ is introduced as an activation function for the deep neural network [24]. To the best of author’s knowledge, $k$ -th hinge loss function $\ell_{H,k}(z)$ with $k\geq 4$ is not introduced in literatures. In this work, we study stabilized $k$ -th order SVM which has arbitrary $k\in\mathbb{N}$ within the proposed Logitron framework.

As observed in [31], the misclassification error $\hbox{Prob}(g_{f}(x)\not=y)$ can also be formulated with the sigmoid probability function $p_{f}(x)=\frac{1}{1+\exp(-f(x))}$ and the corresponding classifier $g_{f}(x)=sign(p_{f}(x)-0.5)$ . In fact, by using the negative log-likelihood of the Bernoulli distribution which has the sigmoid function $p_{f}(x)$ as the probability density function, we get the famous logistic loss function $\ell_{L}(yf(x))=\log(1+\exp(-yf(x)))$ and the corresponding logistic regression formulation:

[TABLE]

where $(x_{i},y_{i})\in{\cal X}\times\{-1,+1\}$ . The main advantage of this model is that the logistic loss function $\ell_{L}$ is sufficiently smooth and the gradient of it is the sigmoid probability function. That is, let $y=-1$ then we have $\frac{d\ell_{L}(-f)}{df}=\frac{1}{1+\exp(-f(x))}=p_{f}(x)$ . Though, the logistic regression is a typical example of the margin-based classification model, since $\ell_{L}(z)>0$ for all $z\in{\mathbb{R}}$ , it is unclear how to connect this model to the SVM, the max-margin classifier.

The proposed Logitron, having the Perceptron-augmented extended convex logistic loss function, is inherently similar to the logistic regression with an additional margin control parameter. Roughly, we can say that the Logitron is the generalized $q$ -th order SVM with an additional stabilization $q$ -th root function ( $q\in{\mathbb{R}}\setminus[0,1)$ ). Depending on the choice of parameters, we have the Hinge-Logitron with hinge-like loss function with relatively small value of $|q|$ and the Logistic-Logitron with logistic-like loss function with relatively larger value of $|q|$ . In terms of logistic regression framework, when $|q|$ is relatively large, the generalized $q$ -th order SVM corresponds to the exponential function and the stabilization $q$ -th root function corresponds to the logarithmic function. Interestingly, even $q<0$ , we have classification model which satisfying the classification-calibrated condition [4]. In fact, when $q=-1$ , the Hinge-Logitron is implementable with simple elementary mathematical operations such as division and show reasonable classification performance. Note that the margin of the Logitron loss function is defined as the intersection point of the closure of the domain of the extended exponential function and the Perceptron loss function. When the intersection point is located on the positive real line ( $q>0$ ), it corresponds to the classic margin. Interestingly, the Logitron loss function is sufficiently smooth on its entire domain ${\mathbb{R}}$ under the mild restriction of the parameter and therefore, we can easily use the conventional gradient-based optimization model to find a solution of the Logitron model.

As regards the numerical experiments, for multi-class classification problem, we have used OVA (one-vs-all) framework. The Hinge-Logitron H-4 (i.e., the fourth-order SVM with the fourth-root stabilization function) shows the best performance in learning hyperplanes (1). Compared to the conventional second-order SVM, known as L2SVM [18], the proposed Hinge-Logitron H-2 (i.e., second order SVM with root stabilizer function) shows better performance in terms of classification accuracy. The Logistic-Logitron L- (i.e., a group of the Logitron model with $q=5,6,8,12$ ) shows the best performance with respect to the Friedman ranking [15]. As a by-product of the generalization to the negative region of $q$ , we obtain classification-calibrated new classification model. This new classification model also shows better performance than the conventional logistic regression and SVM in terms of the classification accuracy.

I-A Notation

We briefly review a convex function and related useful notations such as extended-valued function. See [33, 26, 10] for more details.

Let $h:\hbox{dom}(h)\rightarrow{\mathbb{R}}$ be a convex, lower semicontinuous, and proper function on its convex domain

[TABLE]

As observed in [26], the convexity of $h$ can be extended to the whole real line ${\mathbb{R}}$ by using the extended-valued function $h^{e}:{\mathbb{R}}\rightarrow{\mathbb{R}}_{\infty}$ :

[TABLE]

where ${\mathbb{R}}_{\infty}={\mathbb{R}}\cup\{+\infty\}$ and $\Omega=\hbox{dom}(h)$ . Depending on applications [42], $\Omega$ can be any convex set in ${\mathbb{R}}$ . Unless otherwise stated, as suggested in [26], a convex function in this work is an extended-valued convex function (6) and, for simplicity, we will drop the superscript ’e’ in the extended-valued function $h^{e}$ . In ${\mathbb{R}}_{\infty}$ (an extended-valued real number system), we introduce several arithmetic operations with $+\infty$ which are useful later. That is, $a+\infty=\infty$ for all $a\in{\mathbb{R}}$ , $1/\infty=0$ (it means $\lim_{n\rightarrow+\infty}1/n=0$ ), $1/0=+\infty$ (it means $\lim_{\epsilon\rightarrow 0_{+}}1/\epsilon=+\infty$ ), and $1=\frac{\infty+a}{\infty+b}$ (it means $\lim_{z\rightarrow c_{\alpha}}\frac{h(z)+a}{h(z)+b}=1$ with $h(c_{\alpha})=+\infty$ and $a,b\in{\mathbb{R}}$ ).

Let $\Omega$ be any convex set in ${\mathbb{R}}$ . Then $int(\Omega)$ is the interior of $\Omega$ and $bd(\Omega)=cl(\Omega)\setminus int(\Omega)$ is the boundary of $\Omega$ . Here $cl(\Omega)$ is the closure of $\Omega$ . We also set ${\mathbb{R}}_{++}=\{z\in{\mathbb{R}}\;|\;z>0\}$ , ${\mathbb{R}}_{+}=\{z\in{\mathbb{R}}\;|\;z\geq 0\}$ , ${\mathbb{R}}_{\geq c_{\alpha}}=\{z\in{\mathbb{R}}\;|\;z\geq c_{\alpha}\}$ , and ${\mathbb{R}}_{>c_{\alpha}}=\{z\in{\mathbb{R}}\;|\;z>c_{\alpha}\}$ . The corresponding negative intervals are also defined in the same way. Note that $\mathbb{Q}$ is a set of rational number, $\mathbb{Z}$ is a set of integer, and $\mathbb{N}$ is a set of natural number. Additionally, $\hbox{dom}(h)$ is always assumed to be a convex set, irrespective of convexity of $h$ .

I-B Overview

The paper is organized as follows. In Section II, we review extended exponential and logarithmic functions which are studied in [42, 43]. In Section III, we introduce the extended logistic loss function defined with the extended exp and log function and the corresponding general classification framework, Logitron. The loss function of it is a smoothed stitching of the Perceptron loss and the restricted version of the extended logistic loss. In Section IV, we reinterpret the Logitron by the generalized $q$ -th order SVM with the $q$ -th root stabilization function. Here $q\in\mathbb{Q}\cap{\mathbb{R}}\setminus[0,1)$ . Actually, L2SVM, known as the SVM with squared hinge loss, can be reformulated into the Hinge-Logitron H-2 with an additional root stabilization function. In Section V, we evaluate the performance of the proposed Logitron with more than one hundred datasets [15]. The conclusions are given in Section VI.

II Extended exponential function and extended logarithmic function

In this Section, we review the extended exponential function [43] and its inverse function, the extended logarithmic function [42]. These extended elementary functions are fundamental ingredients of the extended logistic loss function and the Logitron classification model.

Firstly, let us start with the definition of an extended logarithmic function [42]. It is a generalized logarithmic function [2, 1, 38] with an additional scaling parameter. Later, we will explain the role of an additional parameter in details in terms of the margin of the Logitron classification model.

Definition II.1.

Let $\alpha\in{\mathbb{R}}_{+},$ $u\in\hbox{dom}(\ln_{\alpha,c})\subseteq{\mathbb{R}}.$ Then the extended logarithmic function is defined as

[TABLE]

where $c\in{\mathbb{R}}_{c,u}=\{c\in{\mathbb{R}}\setminus\{0\}\;|\;\ln_{\alpha,c}(u)\in{\mathbb{R}}\;\hbox{ and }\;\hbox{sign}(u)=\hbox{sign}(c)\}$ . After integration, we have a simplified version of it by

[TABLE]

The convexity of $\ln_{\alpha,c}$ depends on parameters $\alpha$ and $c$ . See [42], for more detail characterization of the domain of $\ln_{\alpha,c}$ . In fact, $\hbox{dom}(\ln_{\alpha,c})$ is rather complicated. As observed in [42], the domain of $\ln_{\alpha,c}$ should be determined to meet the requirement of applications, such as $\beta$ -divergence [42] and statistical Tweedie distribution [43]. If we set $c=1$ , the extended log function $\ln_{\alpha,1}(u)=\int_{1}^{u}x^{-\alpha}dx$ becomes the generalized log function [2, 38].

Secondly, we introduce an extended exponential function [43], the scaled version of the generalized exponential function [2, 38, 16]. Note that the scaling parameter $c$ of the extended exponential function is very important in the Logitron loss function, since it controls the margin of the classification model unlike the generalized exp function.

Definition II.2.

Let $\alpha\in{\mathbb{R}}_{+},$ $v\in\hbox{dom}(\exp_{\alpha,c})\subseteq{\mathbb{R}},$ and

[TABLE]

where $\exp_{\alpha,c}(v)$ is defined to satisfy the following relation:

[TABLE]

where $c\in{\mathbb{R}}_{c,y}=\{c\in{\mathbb{R}}\setminus\{0\}\;|\;sign(y)=sign(c)\}.$ After integration, we get a simplified version of it by

[TABLE]

If we set $c=1$ , the extended exponential function, $\exp_{\alpha,1}(v)$ becomes generalized exponential function in [2, 16]. The convexity of the extended exp function $\exp_{\alpha,c}$ depends on parameters $\alpha,c$ and thus the structure of $\hbox{dom}(\exp_{\alpha,c})$ is complicated [43]. What is even worse, the extended exponential function defined in Definition II.2 does not have inverse relation with the extended log function defined in Definition II.1. Additionally, as observed in [42, 43], the domains of them should be carefully selected to meet various conditions related to the high level structures. A typical example is a condition of convex function of Legendre type [33]. With the restricted domains satisfying the condition of the convex function of Legendre type, it is possible to obtain rather complicated dual relation between $\beta$ -divergence and the Tweedie distribution [43, 42, 25, 3].

In this work, we are going to use extended exp and log functions for classification purpose only. Hence, we significantly reduce domains of them. See Table I for more details.

Now, $\exp_{\alpha,c}$ and $-\ln_{\alpha,c}$ with domains in Table I are convex and extended-valued functions. We summarize various properties of them below.

Proposition II.3.

Let $(\alpha,c)\in({\mathbb{R}}_{+}\setminus\{1\})\times{\mathbb{R}}_{++}$ . Then the extended exp function $\exp_{\alpha,c}:{\mathbb{R}}\rightarrow{\mathbb{R}}_{\infty}$ and the extended log function $-\ln_{\alpha,c}:{\mathbb{R}}\rightarrow{\mathbb{R}}_{\infty}$ have the following properties with the domains in Table I. Here $x_{\alpha}=\frac{x^{1-\alpha}}{\alpha-1}$ and $0_{+}=\lim_{\varepsilon\rightarrow 0^{+}}0+\varepsilon$ .

$\forall\alpha\in(0,1)$ , $\exp_{\alpha,c}$ is strictly increasing and, $\forall\alpha\in{\mathbb{R}}_{+}\setminus\{1\}$ , $\ln_{\alpha,c}$ is strictly increasing. 2. 2.

$\exp_{\alpha,c}$ * and $-\ln_{\alpha,c}$ are convex functions on their domains.*

Proof.

Under domains in Table I, it is easy to see $\ln_{\alpha,c}(x)=c_{\alpha}-x_{\alpha}$ is strictly increasing. In case of $\exp_{\alpha,c}(x)$ , since $1-\frac{x}{c_{\alpha}}>0$ for all $x\in\hbox{dom}(\exp_{\alpha,c})$ , $\exp_{\alpha,c}(x)=c(1-\frac{x}{c_{\alpha}})^{\frac{1}{(1-\alpha)}}$ is strictly increasing when $\alpha\in(0,1)$ .

2)

Since $x_{\alpha}$ is convex for all $x\in\hbox{dom}(x_{\alpha})\cap{\mathbb{R}}_{+}$ , $-\ln_{\alpha,c}(x)=x_{\alpha}-c_{\alpha}$ is a convex function on its domain in Table I. For all $x\in int(\hbox{dom}(\exp_{\alpha,c}))$ , we have $\exp^{\prime\prime}_{\alpha,c}(x)>0$ and convexity can be easily extended to the boundary of the domain in Table I.

Now, we will show that the extended exponential function (11) and the extended logarithmic function (8) are well-defined (i.e., $\exp_{\alpha,c}=\ln_{\alpha,c}^{-1}$ and $\ln_{\alpha,c}=\exp_{\alpha,c}^{-1}$ ). Actually, we show the isomorphic inverse relation between (8) and (11) below under the restricted domains in Table I.

Lemma II.4.

. Let $\alpha\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ . Then we have the bijective mapping between the the extended log (8) and the extended exp (11) functions with the restricted domains in Table I:

[TABLE]

with the corresponding inverse map $\ln_{\alpha,c}^{-1}=\exp_{\alpha,c}$

[TABLE]

Note that the proof of Lemma II.4 can be easily derived from Table I and the definition of the extended exp (11) and extended log (8). The following Lemma is useful while we define the loss function for the classification model. In fact, the range of the extended exponential function always equals to the domain of the extended logarithmic function, irrespective of choice of parameters $\alpha$ and $c$ .

Lemma II.5.

For any $\alpha,\beta\in{\mathbb{R}}_{+}$ , $c_{1},c_{2}\in{\mathbb{R}}_{++}$ , and domains in Table I, we have

[TABLE]

Proof.

Due to the isomorphic mapping in Lemma II.4 (i.e., $\ln^{-1}_{\alpha,c}=\exp_{\alpha,c}$ and $\ln_{\alpha,c}=\exp^{-1}_{\alpha,c}$ ) on domains defined in Table I, we have

[TABLE]

As observed in Table I, the domain of $\ln_{\alpha,c}$ does not depend on the choice of $\alpha$ and $c$ . Hence, we have

[TABLE]

for any choice of $\alpha,\beta\in{\mathbb{R}}$ and $c_{1},c_{2}\in{\mathbb{R}}_{++}$ .

The independency of the parameter $\alpha$ and $\beta$ introduced in Lemma II.5 is very useful while we characterize the structure of the extended logistic loss function in the coming Section.

III Logitron: An extended Logistic regression classification model augmented with the Perceptron

This Section introduces a general classification framework. That is, the Logitron classification model with the Perceptron-augmented extended logistic loss function.

Let us start with the extended logistic loss function, which is a simple combination of $\exp_{\alpha,c}$ and $\ln_{\alpha,c}$ in the logistic regression style. In fact, it covers many loss functions appearing in classification such as exponential loss, (extended) sigmoid function, the Savage loss function and so on.

Definition III.1.

Let $\alpha,\beta\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ . Then the extended logistic loss function $\ell_{\alpha,\beta,c}:\hbox{dom}(\ell_{\alpha,\beta,c})\rightarrow{\mathbb{R}}$ is defined as

[TABLE]

where $\hbox{dom}(\ell_{\alpha,\beta,c})=\{x\in{\mathbb{R}}\;|\;-x\in\hbox{dom}(\exp_{\beta,c})\;\}$ . Note that $\hbox{dom}(\exp_{\beta,c})$ is the restricted domain in Table I.

By virtue of Lemma II.5, the extended logistic loss in (13) is well defined with the restricted domain in Table I, irrespective of choices of $\alpha,\beta\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ . The classic logistic loss (4) is recovered when we set $\alpha=\beta=1$ , irrespective of the choice of the auxiliary parameter $c$ . Since we do not put any constraints on $\alpha$ and $\beta$ , it is questionable when the extended logistic loss $\ell_{\alpha,\beta,c}(x)$ (13) is acting like the conventional logistic loss function (4). The following theorem gives a partial answer in terms of convexity of $\ell_{\alpha,\beta,c}$ (13).

Theorem III.2.

Let $\alpha,\beta\in{\mathbb{R}}_{+}$ with $\beta\geq\alpha$ and $c\in{\mathbb{R}}_{++}$ . Then the extended logistic loss function $\ell_{\alpha,\beta,c}$ (13) is convex on $\hbox{dom}(\ell_{\alpha,\beta,c})=\{x\in{\mathbb{R}}\;|\;-x\in\hbox{dom}(\exp_{\beta,c})\}$ .

Proof.

Let us assume that $p\in int(\hbox{dom}(\ell_{\alpha,\beta,c}))$ . Then $h(p)=\exp_{\beta,c}(-p)>0$ and $\frac{h(p)}{c+h(p)}\in(0,1)$ . For all $\beta\geq\alpha\geq 0$ , we have

[TABLE]

Now, we only need to extend convexity to the boundary $\hbox{dom}(\ell_{\alpha,\beta,c})\cap bd(\hbox{dom}(\ell_{\alpha,\beta,c})).$ From Table I, we have $-c_{\beta}\in\hbox{dom}(\ell_{\alpha,\beta,c})\cap bd(\hbox{dom}(\ell_{\alpha,\beta,c}))$ when $0\leq\beta<1$ . In fact, from the convexity of $\ell_{\alpha,\beta,c}$ , we have

[TABLE]

where $\lambda\in[0,1]$ , $a<b$ and $a,b\in int(\hbox{dom}(\ell_{\alpha,\beta,c})).$ By sending $b\rightarrow-c_{\beta}$ , we can easily extend convexity up to the $\hbox{dom}(\ell_{\alpha,\beta,c})$ .

Although the nonconvex extended logistic loss ( $\beta<\alpha$ ) is not main concern of this work, it is worth mentioning about the nonconvex loss function. As observed in [20], a nonconvex loss has some advantages in terms of robustness against the label noise. Actually, various nonconvex loss functions are proposed in boosting [30, 20, 32, 28], most of them are a subclass of the extended logistic loss. In the following example, we demonstrate higher-order sigmoid function which is a typical example of the nonconvex extended logistic loss function (13). They are known as the robust loss function in boosting [28] or activation function [13, 35] in (multilayer Perceptron) neural network.

Example III.3 (higher-order sigmoid function).

Let us consider the extended logistic loss function with $\alpha=m+1$ ( $m\in\mathbb{N}$ ) and $\beta=1$ (higher-order sigmoid function):

[TABLE]

where $\sigma(z)=\frac{1}{1+\exp(-z)}$ is a sigmoid function and $c_{m+1}=\frac{c^{-m}}{m}$ . Note that $\ell_{m+1,1,c}(z)\in(0,c_{m+1})$ for all $z\in{\mathbb{R}}$ . In fact, the Savege loss function [28] is the second-order sigmoid function ( $\alpha=3$ ) and the activation function in multilayer Perceptron neural network [13, 35] is the first-order sigmoid function ( $\alpha=2$ ).

•

First-order sigmoid ( $\alpha=2,\beta=1$ ): $\ell_{2,1,c}(z)=\sigma(-z)$ where $c=1$ .

•

Second-order sigmoid ( $\alpha=3,\beta=1$ ): $\ell_{3,1,c}(z)=1-\sigma^{2}(z)$ where $c=0.5^{0.5}$ , $z\in{\mathbb{R}}$ and $\ell_{3,1,c}(z)\in(0,1)$ for all $z\in{\mathbb{R}}$ . In **[28]**, authors have introduced $\sigma^{2}(-z)=1-\ell_{3,1,c}(-z)$ as the Savage loss function in boosting. This model is known to be more robust to label noise compared to other boosting models having convex loss functions such as Adaboost **[19]** and LogitBoost **[21]**. However, within the convex loss function, the LogitBoost with logistic loss is more robust than the Adaboost with the exponential loss **[21]**.

Since we are mainly interested in convex loss function, having similar features of the loss functions used in logistic regression and SVM, we restrict the extended logistic loss function (13) by the following condition.

[TABLE]

Now, let us simplify the notation of the extended logistic regression function with the extended-valued function by

[TABLE]

Here $\alpha\in{\mathbb{R}}_{+}$ is a model parameter and $c\in{\mathbb{R}}_{++}$ is a margin parameter. As observed in Figure 1 and 2, the search space of two parameters are significantly reduced and thus they are not a big burden while running the cross-validation. The only concern of (16) is that the domain $\hbox{dom}(\ell_{\alpha,c})=\{x\in{\mathbb{R}}\;|\;-x\in\hbox{dom}(\exp_{\alpha,c})\}$ depends on $c_{\alpha}$ (see Table I). This is definitely a barrier for various applications appearing in machine learning. However, interestingly, the domain dependency problems of (16) could be easily escaped by using the Perceptron loss function. We call the Perceptron-augmented loss function of (16) as the Logitron loss function and the corresponding minimization model for classification as Logitron. The details are following.

Definition III.4 (Logitron).

Let $(x_{1},y_{1}),...,(x_{N},y_{N})\in{\cal X}\times{\cal Y}$ be the given training dataset. Here ${\cal X}=\{x\in{\mathbb{R}}^{n}\;|\;\|x\|_{\infty}\leq R_{\cal X}\}$ , ${\cal Y}=\{-1,+1\}$ , and $R_{\cal X}\in{\mathbb{R}}_{++}$ . Also, we set $(\alpha,c)\in{\mathbb{R}}_{+}\times{\mathbb{R}}_{++}$ . Then we have the Logitron model:

[TABLE]

where ${\cal F}$ is an appropriate function space such as (1) and $L_{\alpha,c}:{\mathbb{R}}\rightarrow{\mathbb{R}}_{+}$ is the Logitron loss function defined by

[TABLE]

where $\ell_{P}(z)=\max(0,-z)$ is the Perceptron loss function in (2) and $\hbox{dom}(\ell_{\alpha,c})=\{z\in{\mathbb{R}}\;|\;-z\in\hbox{dom}(\exp_{\alpha,c})\}.$

Since the Perceptron loss is added to the extended logistic loss function, we have $\emptyset=\{z\in{\mathbb{R}}\;|\;\ell_{\alpha,c}(z)=+\infty\;\}$ . That is, the domain of the Logitron is the entire real line. Moreover, the Logitron loss is continuously twice differentiable on its entire domain ${\mathbb{R}}$ under the mild condition. See also Figure 1 and 2 for the graph of the Logistic loss and the gradient of it.

Theorem III.5.

Let $c\in{\mathbb{R}}_{++}$ . Then the Logitron loss function $L_{\alpha,c}:{\mathbb{R}}\rightarrow{\mathbb{R}}_{+}$ (18) is convex and continuous for all $\alpha\in{\mathbb{R}}_{+}$ . When $\alpha\in{\mathbb{R}}_{++}$ , it is continuously differentiable. Moreover, if $\alpha\in(0.5,2)$ then it is continuously twice differentiable.

Proof.

When $\alpha=1$ , we get $\ell_{\alpha,c}=\ell_{L}$ the logistic loss in (4). Thus, $\ell_{L}(z)\cap\ell_{P}(z)=\emptyset$ for all $z\in{\mathbb{R}}$ and $\ell_{L}(z)$ is infinitely differentiable on ${\mathbb{R}}$ . Let us consider $\alpha\in{\mathbb{R}}_{+}\setminus\{+1\}$ .

Firstly, for the continuity of $L_{\alpha,c}$ , we only need to show that

[TABLE]

where $-z\in{\mathbb{R}}_{++}$ and $\Omega=dom(\ell_{\alpha,c})$ .

•

$0\leq\alpha<1$ : We have $-c_{\alpha}>0$ and $dom(\ell_{\alpha,c})={\mathbb{R}}_{\leq-c_{\alpha}}$ . Thus, $bd(\Omega)=\{-c_{\alpha}\}$ . Therefore, we get $\ell_{\alpha,c}(-c_{\alpha})=\ln_{\alpha,c}(c+\exp_{\alpha,c}(c_{\alpha}))=0=\ell_{P}(-c_{\alpha}).$ Additionally, since $\exp_{\alpha,c}$ is strictly increasing and convex (Proposition II.3 (1) and (2)), we have $\varepsilon=\exp_{\alpha,c}(-z)>0$ for all $z\in int(\Omega)\cap{\mathbb{R}}_{+}$ . Therefore, since $\ln_{\alpha,c}$ is strictly increasing and $\ln_{\alpha,c}(c)=0$ , we have $\ell_{\alpha,c}(z)=\ln_{\alpha,c}(c+\varepsilon)>0$ . Additionally, for all $z\in int(\Omega)\cap{\mathbb{R}}_{-}$ , we have $-z<\ell_{\alpha,c}(c+\exp_{\alpha,c}(-z))$ from $\exp_{\alpha,c}(-z)<c+\exp_{\alpha,c}(-z)$ .

•

$\alpha>1$ : We have $c_{\alpha}>0$ and $dom(\ell_{\alpha,c})={\mathbb{R}}_{>-c_{\alpha}}$ . Thus $bd(\Omega)=\{-c_{\alpha}\}$ . Since $bd(\Omega)\cap\Omega=\emptyset$ , we need to be cautious on the boundary point. From the extended-valued real number system, we have $(c+\infty)_{\alpha}=0$ and thus

[TABLE]

Note that it is easy to check that $\exp_{\alpha,c}(-z)>0$ for all $z\in int(\Omega)\cap{\mathbb{R}}_{+}$ and $-z<\ell_{\alpha,c}(c+\exp_{\alpha,c}(-z))$ for all $z\in int(\Omega)\cap{\mathbb{R}}_{-}$ .

Secondly, we will show continuously differentiability of the Logitron, $L_{\alpha,c}(z)$ on its entire domain ${\mathbb{R}}$ .

•

$0<\alpha<1$ : $dom(\ell_{\alpha,c})={\mathbb{R}}_{\leq-c_{\alpha}}$ and $bd(\hbox{dom}(\ell_{\alpha,c}))=\{-c_{\alpha}\}$ . By simple calculation, we have

[TABLE]

where $\ell^{\prime}_{\alpha,c}(z)=-\left(\frac{\exp_{\alpha,c}(-z)}{c+\exp_{\alpha,c}(-z)}\right)^{\alpha}<0,\;\forall z\in int(\hbox{dom}(\ell_{\alpha,c})).$ Since $\alpha\in(0,1)$ , as $z\rightarrow-c_{\alpha}\in bd(\hbox{dom}(\ell_{\alpha,c}))$ , we get $\ell^{\prime}_{\alpha,c}(z)\rightarrow 0$ and $\ell^{\prime}_{P}(-c_{\alpha})=0$ . Therefore, $L^{\prime}_{\alpha,c}(z)$ is well defined for all $z\in{\mathbb{R}}$ .

•

$\alpha>1$ : $dom(\ell_{\alpha,c})={\mathbb{R}}_{>-c_{\alpha}}$ and $c_{\alpha}>0$ . Since $int(\hbox{dom}(\ell_{\alpha,c}))=\hbox{dom}(\ell_{\alpha,c})$ , we have

[TABLE]

where $\ell^{\prime}_{\alpha,c}(z)=-\left(\frac{\exp_{\alpha,c}(-z)}{c+\exp_{\alpha,c}(-z)}\right)^{\alpha}>-1$ for all $z\in\hbox{dom}(\ell_{\alpha,c})$ . On the other hand, since $\exp_{\alpha,c}(c_{\alpha})=\infty$ at $-c_{\alpha}\in bd(\hbox{dom}(\ell_{\alpha,c}))$ , we have $\ell^{\prime}_{\alpha,c}(-c_{\alpha})=\left(\frac{\infty}{c+\infty}\right)^{\alpha}=-1$ .

Thirdly, for continuously twice differentiability of the Logitron loss, let us take the second derivative of $\ell_{\alpha,c}$ . Then, $\forall\alpha\in(0.5,2)$ and $z\in int(\hbox{dom}(\ell_{\alpha,c}))$

[TABLE]

Let us consider the case $\alpha\in(0.5,1)$ and $(1,2)$ .

•

$0.5<\alpha<1$ : From $bd(\hbox{dom}(\ell_{\alpha,c}))\cap\hbox{dom}(\ell_{\alpha,c})=\{-c_{\alpha}\}$ , we get

[TABLE]

•

$1<\alpha<2$ : From $bd(\hbox{dom}(\ell_{\alpha,c}))=\{-c_{\alpha}\}$ and $\exp_{\alpha,c}(z)=c(1-\frac{z}{c_{\alpha}})^{1/(1-\alpha)}$ , we have $\exp_{\alpha,c}(-z)^{\alpha-2}=c^{\alpha-2}\left(1-\frac{-z}{c_{\alpha}}\right)^{\frac{\alpha-2}{1-\alpha}}.$ Thus, $\ell^{\prime\prime}_{\alpha,c}(-c_{\alpha})=0$ .

Additionally, it is trivial that $\ell_{P}^{\prime\prime}(z)=0,\;\forall z\in{\mathbb{R}}\setminus\{0\}$ . Finally, $\forall\alpha\in(0.5,2)$ and $\forall z\in{\mathbb{R}}$ , we get the continuous second derivative of the Logitron loss function

[TABLE]

Due to the Theorem III.5, the Logitron loss function can be used as a classification loss function for all $\alpha\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ . In fact, it is classification-calibrated [4].

Corollary III.6.

For all $\alpha\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ , the Logitron loss function $L_{\alpha,c}(z)$ is classification-calibrated [4].

Proof.

From Theorem III.5, $L_{\alpha,c}(z)$ is convex and differentiable at $z=0$ for all $\alpha\in{\mathbb{R}}_{+}$ .

[TABLE]

Therefore, we have $L^{\prime}_{\alpha,c}(0)=-\frac{1}{2^{\alpha}}<0$ for all $\alpha\geq 0$ . Hence $L_{\alpha,c}(z)$ is classification-calibrated, irrespective of the choice of $c\in{\mathbb{R}}_{++}$ .

Additionally, the Logitron loss function (18) is sufficiently smooth. That is, the gradient of it is continuous on its entire domain ${\mathbb{R}}$ and bounded by one. Therefore, we could use any gradient-based optimization method such as L-BFGS [29].

Corollary III.7.

For all $\alpha\in{\mathbb{R}}_{+}$ and $c\in{\mathbb{R}}_{++}$ , the Logitron loss function $L_{\alpha,c}(z)$ is Lipschitz continuous with Lipschitz constant one for all $z\in{\mathbb{R}}$ . That is, we have

[TABLE]

for all $z_{1},z_{2}\in{\mathbb{R}}$ .

Proof.

Let $\alpha\in{\mathbb{R}}_{++}$ . From (19) and (20), we have $|L^{\prime}_{\alpha,c}(z)|\leq 1$ . Moreover, when $\alpha=0$ , we get $|\partial L_{\alpha,c}(z)|\leq 1$ . Here $\partial L_{\alpha,c}(z)$ is a subgradient of $L_{\alpha,c}$ .

Before we go further, it is worth mentioning about the (un)hinge loss function. The extended logistic loss function with $\alpha=0$ has an unconventional hinge loss function, known as unhinged loss function [34]. In fact, the extended logistic loss under the domains in Definition II.1 and II.2 becomes $\ell_{0,c}(z)=[c+(-z+c)]-c=c-z$ where $-z\in{\mathbb{R}}$ . The main advantage of this unhinged loss function is that it is robust to symmetric label noise. In fact, as observed in [34], if the convex function is lower bounded, then it is not robust to symmetric label noise. However, the Logitron with $\alpha=0$ is the hinge loss function which can be reformulated with first-order hinge loss function in (3):

[TABLE]

The Logitron with $0<\alpha<1$ can be regarded as the smoothed hinge loss when we set $c_{\alpha}=-1$ . Actually, the $k$ -th order hinge loss in (3) with an additional $k$ -th root stabilizer function, is a special case of the Logitron with $0<\alpha<1$ and $c_{\alpha}=-1$ . Also, as observed in Figure 1 and 2, the Logitron loss function with $c=1$ behaves like the logistic loss function when $\alpha\approx 1$ . Therefore, it is natural to separate the Logitron into the two category; one is the hinge-like Logitron loss function and the other is the logistic-like Logitron loss function. In the coming Section IV, we analyze the Logitron model in two different points of view.

IV The Low complexity Logitron with $\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}$

In the previous Section, we found that the Logitron has many useful properties such as smoothness and classification-calibration. However, to be more practical in terms of computation, we need to reduce the spaces of model parameter $\alpha$ and of margin parameter $c$ . In this Section, we introduce the low complexity Logitron loss function (18) with $\alpha\in{\mathbb{R}}_{+}\cap\mathbb{Q}$ based on higher-order hinge loss in (3). With additional restriction of $\alpha$ and $c$ , we have two different categories of the Logitron; one is the hinge loss-like Logitron (Hinge-Logitron) and the other is the logistic loss-like Logitron (Logistic-Logitron).

Let us start with the generalized $q$ -th order hinge loss function. As stated in [22], it corresponds to a basis function of the generalized $q$ -th order spline.

[TABLE]

where $q\in\mathbb{Q}^{*}=\{\frac{a}{b}\;|\;\frac{a}{b}\not\in[0,1)\hbox{ and }a,b\in\mathbb{Z}\}$ . Interestingly, the low complexity Logitron with $\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}$ can be easily reformulated with the generalized $q$ -th order hinge loss in (23). In fact, let us modify the extended exponential function $\exp_{\alpha,c}(z)$ with $\max(0,\cdot)$ :

[TABLE]

where $\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}$ . Then we have a connection between (23) and (24):

[TABLE]

If we set $|c_{\alpha}|=1$ and $q=\frac{1}{1-\alpha}$ then $\hbox{Exp}_{\alpha,c}$ becomes the generalized $q$ -th order hinge loss function with an additional margin control parameter $c$ . Now, let us reformulate the Logitron with the modified extended exponential function $\hbox{Exp}_{\alpha,c}$ (24).

Theorem IV.1.

Let $\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}$ and $c\in{\mathbb{R}}_{++}$ . Then the low complexity Logitron can be reformulated with $\hbox{Exp}_{\alpha,c}(\cdot)$ (24):

[TABLE]

where

[TABLE]

Proof.

When $\alpha=0$ , we get the first-order hinge loss in (3). Now, let us assume that $\alpha\in(0,1)$ , then we have $-z\in dom(\exp_{\alpha,c})={\mathbb{R}}_{\geq c_{\alpha}}$ . In this region, it is easy to see $\exp_{\alpha,c}(-z)=\hbox{Exp}_{\alpha,c}(-z)$ . Also, when $-z\not\in\hbox{dom}(\exp_{\alpha,c})$ , we get $\hbox{Exp}_{\alpha,c}(-z)=0$ and, from the definition of the Logitron loss in (18), $\ell_{P}(z)=0$ . Now let use assume that $\alpha>1$ . For all $-z\in\hbox{dom}(\exp_{\alpha,c})={\mathbb{R}}_{<c_{\alpha}}$ , it is easy to check $\hbox{Exp}_{\alpha,c}(-z)=\exp_{\alpha,c}(-z)$ . On the other hand, let $-z\not\in\hbox{dom}(\exp_{\alpha,c})={\mathbb{R}}_{<c_{\alpha}}$ , then $-z\geq c_{\alpha}$ and thus we have $\hbox{Exp}_{\alpha,c}(-z)=c\ell_{H,\frac{1}{1-\alpha}}\left(\frac{-z}{c_{\alpha}}\right)=\left(\frac{1}{c\max\left(0,1-\frac{(-z)}{c_{\alpha}}\right)}\right)^{\frac{1}{\alpha-1}}=\left(\frac{1}{0}\right)^{\frac{1}{\alpha-1}}=\infty$ and $\ln_{\alpha,c}(c+\hbox{Exp}_{\alpha,c}(-z))=\ln_{\alpha,c}(c+\infty)=\frac{1}{1-\alpha}\left(\frac{1}{c+\infty}\right)^{\alpha-1}+c_{\alpha}=c_{\alpha}$ . Here, with an additional $\max(-z,\cdot)$ function, we have $\max(-z,{\cal L}_{\alpha,c}(z))=\max(-z,c_{\alpha})=-z.$ Note that if $z<0$ then $\ell_{P}(z)=\max(0,-z)=-z$ . Therefore, when $z\not\in\hbox{dom}(\ell_{\alpha,c})={\mathbb{R}}_{>-c_{\alpha}}$ (i.e., $z\leq-c_{\alpha}$ ), we have $\max(-z,{\cal L}_{\alpha,c}(z))=L_{\alpha,c}(z).$

Note that, in Theorem IV.1, though we have restricted the range of $\alpha$ for practical reason, it can be extended to ${\mathbb{R}}_{+}\setminus\{1\}$ . When we set $|c_{\alpha}|=1$ , the Logitron loss is similar to the generalized $q$ -th order hinge loss function (23). However, if we set $c=1$ then the role of the extended exp and log is the approximation of the conventional exp and log function. Especially, when $\alpha\approx 1$ , the Logitron loss function almost equals to the logistic loss function. See Figure 1 and 2. As a consequence, we have four different categories of the Logitron loss function based on the model parameter $\alpha$ and the margin parameter $c$ .

•

Hinge-Logitron ( $|c_{\alpha}|=1$ ): H-Logitron ( $0<\alpha<1$ and $c_{\alpha}=-1$ ) and H+Logitron ( $\alpha>1$ and $c_{\alpha}=1$ )

•

Logistic-Logitron ( $c=1$ ): L-Logitron ( $0<\alpha<1$ and $c=1$ ) and L+Logitron ( $\alpha>1$ and $c=1$ )

Since the parameter $q$ of the generalized $q$ -th order hinge-loss function (23) can be negative, the classic margin concept is also required to be generalized. We call the classic margin as positive margin if the loss function touch the Perceptron loss on the positive axis. On the other hand, if the loss function touch the Perceptron loss on the negative axis, then we call that touch point as the negative margin. Actually, the positive margin ( $\alpha<1$ ) and negative margin ( $\alpha>1$ ) equals to the value of $|c_{\alpha}|$ (i.e., $bd(\hbox{dom}\ell_{\alpha,c})$ ). Therefore, since the logistic regression does not touch the Perceptron loss, it does not have margin. In Hinge-Logitron, the H-Logitron loss function has positive margin like the higher-order hinge loss (3). Figure 1 (c) compare the H-Logitron with the first-order hinge-loss (SVM) and the second order hinge-loss (L2SVM). However, the H+Logitron loss function has negative margin through the Perceptron line (i.e., $\ell_{P}(z)$ ). See Figure 2 (c) for the shape of the H+Logitron with various different choice of $\alpha=i/5$ and $i=6,7,8,9$ . As regards the Logistic-Logitron model, we have the L-Logitron loss function approximating the logistic loss with positive margin and the L+Logitron loss function approximating the logistic loss with negative margin. See Figure 1 (d) and Figure 2 (d), respectively.

It is useful seeing a direct connection between higher-order hinge loss in (3) and the Logitron loss function $L_{\alpha,c}(z)$ with $\frac{1}{1-\alpha}=k\in\mathbb{Z}\setminus\{0\}$ . Here $\alpha=1-k^{-1}\in[0,2]$ . Then (25) is simplified as $\hbox{Exp}_{1-k^{-1},c}(z)=c\ell_{H,k}\left(\frac{z}{c_{k}}\right)$ with $\alpha=1-k^{-1}$ and $c_{k}=-kc^{1/k}<0$ . Now, when $k\geq 1$ (i.e., $\alpha\in[0,1)$ ), we have H-Logitron ( $c_{k}=-1$ )

[TABLE]

It actually means that the H-Logitron with $\alpha=1-k^{-1}$ and $k\in\mathbb{Z}\setminus\{0\}$ is a higher-order SVM with an additional $k$ -th root stabilizer function. As observed in Figure 1 (c), the second-order hinge-loss (L2SVM) highly penalize the misclassified data. On the other hand, the penalty on the misclassified data of the H-Logitron is stabilized, irrespective of the choice of $k>0$ .

When $k\leq-1$ (i.e., $\alpha>1$ ), we get a totally new classification model, H+Logitron.

[TABLE]

where $c_{k}=-kc^{1/k}>0$ and

[TABLE]

In this instance, we do not have positive margin. That is, $L_{\alpha,c}(z)\cap\ell_{P}(z)=-z$ for all $z\leq-c_{k}$ and $L_{\alpha,c}(z)>0$ for all $z>-c_{k}.$ By controlling $c_{k}$ (i.e., the negative margin), we obtain the closeness of the H+Logitron $L_{\alpha,c}$ to the Perceptron loss function $\ell_{P}$ . Though the H+Logitron does not have the classic margin, i.e., the positive margin, however, due to its simple structure of the model, we need to investigate the H+Logitron model in more details. For instance, let $k=-1(\alpha=2)$ then we have

[TABLE]

where $c_{-1}=c^{-1}$ . Interestingly, we can remove singularity which existed on the boundary of the domain of the extended exponential function. Moreover, as noticed in Theorem III.5, H+Logitron with $k=-1$ has a continuous derivative, $\frac{dL_{2,c}(z)}{dz}=\max\left(-1,-(2+cz)^{-2}\right)$ . The most important feature of (30) is that we only need division and multiplication for the evaluation of the gradient and the loss function itself. This is the main advantage of (30). As observed in Section V, the performance of it is comparable to logistic regression and SVM. Note that, when $k=-2(\alpha=3/2)$ , the H+Logitron can be reformulated as

[TABLE]

This model is rather complicated. However, it is also smooth on the entire domain ${\mathbb{R}}$ and classification-calibrated. In fact, when $z\gg c_{-2}$ , $L_{3/2,c}(z)\approx 0$ and, when $z\leq-c_{-2}$ , $L_{3/2,c}(z)=-z$ . It behaves like the conventional margin-based loss function.

V Experiments with various $\ell_{2}$ -regularized Logitron models

This Section compare performance of the proposed Logitron with logistic regression and SVM within the linear classification framework.

Let us define the Logitron minimization problem with the linear function space in (1). For simplicity, we use $\ell_{2}$ -regularizer, but it could be replaced with a sophisticated regularization model.

[TABLE]

where $Reg(w)=\|w\|_{2}^{2}=\langle w,\,w\rangle$ is the $\ell_{2}$ -regularizer, $H(w,b)=\sum_{i=1}^{N}L_{\alpha,c}(y_{i}[\langle w,\,x_{i}\rangle+b]),$ and $(x_{i},y_{i})\in{\mathbb{R}}^{n}\times\{-1,+1\}$ . Although the loss function $H$ is rather complicated, it has many useful properties for gradient-based optimization. Indeed, the loss function $H$ is convex and differentiable on ${\mathbb{R}}^{n+1}$ , irrespective of the choice of $\alpha\in{\mathbb{R}}_{++}$ . For simplicity, we use the L-BFGS algorithm in minFunc [29]. It is implemented in the MATLAB framework. Note that we use the famous LIBLINEAR package [18] for the benchmark of the proposed Logitron model. Among various linear classification models in LIBLINEAR, we select typical models; logistic regression (4) and higher-order SVM (3) (the first-order SVM and the second-order SVM (i.e., L2SVM)). For logistic regression, we use the primal formulation ( $s=0$ ). For SVM, we use the dual formulation ( $s=3$ ). For L2SVM, we use the primal formulation ( $s=2$ ). We also use the bias term in LINLINEAR ( $B=1$ ). Note that all models have $\ell_{2}$ -regularization term. As regards the regularization parameter $\lambda$ , we simply use the following parameter selection strategy for $\lambda$ as recommended in the LIBSVM [11].

[TABLE]

In the models of LIBLINEAR, the regularization parameter is located on the loss function and thus we use $\lambda^{-1}$ of (33) for the regularization parameter of them.

In terms of parameter space of Logitron, we need to select not only the regularization parameter $\lambda$ but also the model parameter $\alpha$ and the margin parameter $c$ . From the analysis in the earlier Section IV, we know that the Logitron has four different submodels (H-Logitron, H+Logitron, L-Logitron, and L+Logitron). The H-Logitron is the higher-order SVM with an additional stabilization function (28). For simplicity, we only consider $2$ th - $5$ th order SVM with the corresponding $k$ -th root function. In the category of H+Logitron, we have two sub-models; H+Logitron with $\alpha=2(k=-1)$ (30) and H+Logitron with $\alpha=3/2(k=-2)$ (31). Actually, the minimization problem of the H+Logitron with $\alpha=2$ can be solved by using elementary arithmetics such as division and multiplication. In total, we have nine sub-models; H-1( $\alpha=i/5$ with $i=1,2,3,4$ i.e., $k=2,3,4,5$ ), H-2 ( $\alpha=1/2$ , i.e., $k=2$ ), H-3 ( $\alpha=2/3$ , i.e., $k=3$ ), H-4 ( $\alpha=3/4$ , i.e., $k=4$ ), H+1 ( $\alpha=j/5$ with $j=6,7,8,9$ ), H+2 ( $\alpha=2$ ), H+3 ( $\alpha=3/2$ ), L- ( $\alpha=4/5,5/6,7/8,11/12$ ), and L+ ( $\alpha=4/3,5/4,8/7,13/12$ ). Based on the analysis in Section IV, except H-1 and H+1, the model parameter $\alpha$ for all sub-Logitron model is in the category $\{\alpha\in{\mathbb{R}}_{++}\;|\;\frac{1}{1-\alpha}=k\in\mathbb{Z}\setminus\{0\}\}$ . We summarize the parameter space of each sub-Logitron model in Table II. Four-fold cross validation [15] is used to select the optimal parameters of nine sub-Logitron models and three models of LIBLINEAR. Due to the independency of each cross-validation process, it is easy to be implemented in parallel processing machines.

In terms of benchmark dataset, we use the well-organized datasets in [15] while reporting the performance of the nine sub-Logitron models. In fact, they are pre-processed and normalized in each feature dimension with mean zero and variance one. The raw data are mostly in UCI machine learning repository. Note that, as commented in [40], we reorganize the dataset in [15]. First, each dataset is separated into the training and testing data set which are not overlapped. Each training data set is randomly shuffled for $4$ -fold cross validation. Among the dataset in [15], we use $118$ datasets after removing ambiguous dataset in terms of data splitting strategy. In Appendix, we list up all information of datasets such as number of instances, number of train data, number of test data, feature dimension, and number of classes. See Table V for more details. Last but not least, for multi-class datasets, we exploit the one-vs-all strategy, the most commonly used in multi-class classification based on a binary classifier. This strategy is also used in LIBLINEAR [18].

The whole experiments are run five times and the averaged test score of each dataset is reported in Table VI and Table VII in Appendix. In each experiment, the best parameters are chosen through the $4$ -fold cross-validation. With the chosen best parameters, we minimize (32) with the whole training data in Table V to find the hyperplane, i.e., $(w,b)$ . Then we evaluate the performance of each classification model with test dataset in Table V. For more details on CV-based minimization, see [11]. All numerical results are summarized in Table III. In terms of classification accuracy, H-Logitron H-4 is the best classification model and L-Logitron L- obtains the best Friedman ranking [15]. The H-Logitron submodels (H-2, H-3, and H-4) are $k$ -th order SVMs ( $k=2,3,4$ ) with the corresponding $k$ -th root stabilization functions. In this category, as we increase the order of the model, the performance is getting better. What is interesting is that H-2 (the second order SVM with root stabilization function) outperforms the classic second order SVM, i.e., L2SVM [18]. A H+Logitron subbmodel, i.e., the cheapest classification model H+2 with (30), also shows comparable performance to the classic logistic regression.

Table IV presents the dataset in the Best- $1\%$ set of each classifier in terms of the relative classification accuracy (racc $>-1$ ). Here, the relative classification accuracy (racc) is the subtraction of the accuracy of the virtual DWN in [15] from the accuracy of each classifier. Note that the virtual DWN classifier means the best classifier among $179$ classifiers, including boosting, neural network, and random forest, for each individual dataset with respect to the classification accuracy. That is, it is not a specific classifier existed in the real world but an idealistic virtual classifier. Although the function space of the Logitron is linear, interestingly, the proposed Logitron model gets better performance than the optimal DWN classifier in some datasets such as ’hill-valley’, ’acute-inflammation’, ’acute-nephritis’, ’heart-hungarian’, ’credit-approval’, etc.

In Figure 4, 5, and 6, we summarize statistical information of the parameters $\lambda$ , $\alpha$ , and $c$ (or $c_{\alpha}$ ) which are selected via $4$ -fold cross-validation with the training dataset in Table V. Since we did the whole experiments five times, the histograms are generated with $590$ samples. They are normalized for probabilistic interpretation of the parameter data. For each model, we plot histograms of two datasets; Best- $1\%$ (Left) and remainders (Right). Figure 4 shows the normalized histogram of the $\lambda$ with respect to $\log_{2}(\lambda)$ . The regularization parameter $\lambda$ of all Logitron sub-models for the Best- $1\%$ set are mainly located near $2^{-14}$ or $>2^{0}$ . Note that Logitron is not convex with respect to $\lambda$ , $w$ , and $b$ at the same time. Thus, there are many local minima during the selection process of the regularization parameter with cross-validation. Due to the inherent ambiguity, we have many candidate for the best regularization parameter. Therefore, when the training accuracies are even, we simply select a regularization parameter having smaller value. As a result of the regularization parameter selection process, we have relatively high frequency at $2^{-15}$ . Figure 5 visualizes $\alpha$ for various different Logitron sub-models. The Logitron with $\alpha<1$ (i.e., H-1 and L-) in Best- $1\%$ prefers smaller value of $\alpha$ than the remainder set. Figure 6 demonstrates the preference of the margin parameter $c_{\alpha}$ in the Logitron submodels; H-2, H-3, H-4 and H+2, H+3. Overall, H-3,H-4, and H+3 in Best- $1\%$ prefer $|c_{\alpha}|=1$ to the remainder set.

VI Conclusion

In this article, we have introduced a general convex classification framework, i.e., Logitron, which is an extended logistic loss function with the classic Perceptron loss function. The proposed Logitron has several useful features. A typical one is that it is differentiable on the whole real line for all $\alpha\in{\mathbb{R}}_{++}$ . Therefore, it is easy to use the conventional optimization algorithm. Depending on the choice of the parameters, we have two different categories of models; the Hinge-Logitron model ( $|c_{\alpha}|=1$ ) and the Logistic-Logitron model ( $|c|=1$ ). A Hinge-Logitron model H-4 (the fourth-order SVM with an additional fourth root function) outperforms the various other sub-Logitron models and the models in LIBLINEAR [18] in terms of classification accuracy. Additionally, a simple classification model H+2 shows reasonable performance compared to the classic logistic regression. A Logistic-Logitron model L- shows the best performance in terms of Friedman ranking.

Acknowledgments

This article is supported by the Basic Science Program through the NRF of Korea funded by the Ministry of Education (NRF-2015R101A1A01061261). The Logitron is designed based on the machine learning MATLAB package which is available in https://www.cs.ubc.ca/~schmidtm/Software/minFunc.html.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Amari and H. Nagaoka, Methods of Information Geometry , AMS, 2000.
2[2] S. Amari, Information geometry and its applications , Springer, 2016.
3[3] A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, “Clustering with Bregman Divergences”, J. of Mach. Learn. Res. , 6 (2005), pp. 1705-1749.
4[4] P.L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe, ”Convexity, classification, and risk bounds”, J. of the American Stat. Association: Theory and Meth. , 101 (2006) pp.138-156.
5[5] Y. Bengio, Y. Lecun, and G. Hinton, ”Deep Learning”, 521 (2015) pp.436-444.
6[6] L. Bottou and O. Bousquet, ”The tradeoffs of large scale learning”, NIPS (2008).
7[7] S. Boucheron, O. Bousquet, and G. Lugosi, ”Theory of classification: A survey of some recent advances”, ESAIM: Prob. and Stat. , 9 (2005) pp.323-375.
8[8] L. Breiman, J. Friedman, R. Olshen, and C. Stone, ”Classification and regression trees” Wadsworth and Brooks/Cole Advanced Books and Software.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Logitron: Perceptron-augmented classification model based on an extended logistic loss function

Abstract

Index Terms:

I Introduction

I-A Notation

I-B Overview

II Extended exponential function and extended logarithmic function

Definition II.1**.**

Definition II.2**.**

Proposition II.3**.**

Proof.

Lemma II.4**.**

Lemma II.5**.**

Proof.

III Logitron: An extended Logistic regression classification model augmented with the Perceptron

Definition III.1**.**

Theorem III.2**.**

Proof.

Example III.3** **(higher-order sigmoid function).

Definition III.4** **(Logitron).

Theorem III.5**.**

Proof.

Corollary III.6**.**

Proof.

Corollary III.7**.**

Proof.

IV The Low complexity Logitron with α∈(R+∖{1})∩Q\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}α∈(R+​∖{1})∩Q

Theorem IV.1**.**

Proof.

V Experiments with various ℓ2\ell_{2}ℓ2​-regularized Logitron models

VI Conclusion

Acknowledgments

Definition II.1.

Definition II.2.

Proposition II.3.

Lemma II.4.

Lemma II.5.

Definition III.1.

Theorem III.2.

Example III.3 (higher-order sigmoid function).

Definition III.4 (Logitron).

Theorem III.5.

Corollary III.6.

Corollary III.7.

IV The Low complexity Logitron with $\alpha\in({\mathbb{R}}_{+}\setminus\{1\})\cap\mathbb{Q}$

Theorem IV.1.

V Experiments with various $\ell_{2}$ -regularized Logitron models