A Modified Construction for a Support Vector Classifier to Accommodate   Class Imbalances

Matt Parker; Colin Parker

arXiv:1702.02555·stat.ML·February 13, 2017

A Modified Construction for a Support Vector Classifier to Accommodate Class Imbalances

Matt Parker, Colin Parker

PDF

Open Access

TL;DR

This paper proposes a modified support vector classifier that adjusts margins based on class variance to better handle imbalanced data, improving classification accuracy.

Contribution

It introduces a novel SVM formulation with class-specific margins proportional to class standard deviations, enhancing performance on imbalanced datasets.

Findings

01

Improved classification accuracy on imbalanced datasets

02

The modified SVM reduces bias towards the majority class

03

The approach generalizes standard SVM when class variances are equal

Abstract

Given a training set with binary classification, the Support Vector Machine identifies the hyperplane maximizing the margin between the two classes of training data. This general formulation is useful in that it can be applied without regard to variance differences between the classes. Ignoring these differences is not optimal, however, as the general SVM will give the class with lower variance an unjustifiably wide berth. This increases the chance of misclassification of the other class and results in an overall loss of predictive performance. An alternate construction is proposed in which the margins of the separating hyperplane are different for each class, each proportional to the standard deviation of its class along the direction perpendicular to the hyperplane. The construction agrees with the SVM in the case of equal class variances. This paper will then examine the impact to…

Equations58

{x : f (x) = x^{T} β + β_{0} = 0}

{x : f (x) = x^{T} β + β_{0} = 0}

G (x) = sign [x^{T} β + β_{0}]

G (x) = sign [x^{T} β + β_{0}]

β, β_{0}, ∥ β ∥ = 1 max M

β, β_{0}, ∥ β ∥ = 1 max M

subject to y_{i} (x_{i}^{T} β + β_{0}) \geq M i = 1, ..., N

β, β_{0} min ∥ β ∥

β, β_{0} min ∥ β ∥

subject to y_{i} (x_{i}^{T} β + β_{0}) \geq 1 i = 1, ..., N

ζ_{i} = max (0, 1 - y_{i} (x_{i}^{T} β + β_{0}))

ζ_{i} = max (0, 1 - y_{i} (x_{i}^{T} β + β_{0}))

i = 1 \sum N ζ_{i} \leq constant

i = 1 \sum N ζ_{i} \leq constant

β, β_{0} min \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i}

β, β_{0} min \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i}

subject to ζ_{i} \geq 0, y_{i} (x_{i}^{T} β + β_{0}) \geq 1 - ζ_{i} \forall i

L_{P} = \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i} - i = 1 \sum N α_{i} [y_{i} (x_{i}^{T} β + β_{0}) - (1 - ζ_{i})] - i = 1 \sum N μ_{i} ζ_{i}

L_{P} = \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i} - i = 1 \sum N α_{i} [y_{i} (x_{i}^{T} β + β_{0}) - (1 - ζ_{i})] - i = 1 \sum N μ_{i} ζ_{i}

β

β

0

α_{i}

L_{D}

L_{D}

= i = 1 \sum N α_{i} - \frac{1}{2} i = 1 \sum N i^{'} = 1 \sum N α_{i} α_{i^{'}} y_{i} y_{i^{'}} ⟨ x_{i}, x_{i^{'}} ⟩

α_{i} [y_{i} (x_{i}^{T} β + β_{0} - (1 - ζ_{i})]

α_{i} [y_{i} (x_{i}^{T} β + β_{0} - (1 - ζ_{i})]

μ_{i} ζ_{i}

y_{i} (x_{i}^{T} β + β_{0}) - (1 - ζ_{i})

σ_{K, β} = σ_{y_{j}, β}

σ_{K, β} = σ_{y_{j}, β}

= j : y_{j} = y_{i} \sum [(x_{j} - \overline{x}) \cdot (\frac{β}{∥ β ∥})]^{2}^{\frac{1}{2}}

M_{K} = x_{i} \in K min y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}})

M_{K} = x_{i} \in K min y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}})

\frac{M _{A}}{σ _{A, β}} = \frac{M _{B}}{σ _{B, β}}

\frac{M _{A}}{σ _{A, β}} = \frac{M _{B}}{σ _{B, β}}

β, β_{0} min ∥ β ∥

β, β_{0} min ∥ β ∥

subject to y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}}) \geq 1 i = 1, ..., N

ζ_{i} = max (0, 1 - y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}}))

ζ_{i} = max (0, 1 - y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}}))

β, β_{0} min \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i}

β, β_{0} min \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i}

subject to ζ_{i} \geq 0, y_{i} (\frac{x _{i}^{T} β + β _{0}}{σ _{y_{i}, β}}) \geq 1 - ζ_{i} \forall i

L_{P} = \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i} - i = 1 \sum N α_{i} [y_{i} σ_{y_{i}, β}^{- 1} (x_{i}^{T} β + β_{0}) - (1 - ζ_{i})] - i = 1 \sum N μ_{i} ζ_{i}

L_{P} = \frac{1}{2} ∥ β ∥^{2} + C i = 1 \sum N ζ_{i} - i = 1 \sum N α_{i} [y_{i} σ_{y_{i}, β}^{- 1} (x_{i}^{T} β + β_{0}) - (1 - ζ_{i})] - i = 1 \sum N μ_{i} ζ_{i}

0

0

α_{i}

0

0

= \nabla_{β} (\frac{1}{2} ∥ β ∥^{2} - i = 1 \sum N (α_{i} y_{i}) (σ_{y_{i}, β}^{- 1}) (x_{i}^{T} β + β_{0}))

= β - i = 1 \sum N α_{i} y_{i} x_{i} σ_{y_{i}, β}^{- 1} + i = 1 \sum N (α_{i} y_{i}) (σ_{y_{i}, β}^{- 2}) (x_{i}^{T} β + β_{0}) (\nabla_{β} σ_{y_{i}, β})

\nabla_{β} ((x_{j} - \overline{x}) \cdot (\frac{β}{∥ β ∥})) = (x_{j} - \overline{x}) \cdot (\frac{∥ β ∥ ^{2} - β \circ β}{∥ β ∥ ^{3}})

\nabla_{β} ((x_{j} - \overline{x}) \cdot (\frac{β}{∥ β ∥})) = (x_{j} - \overline{x}) \cdot (\frac{∥ β ∥ ^{2} - β \circ β}{∥ β ∥ ^{3}})

0 = β - i = 1 \sum N α_{i} y_{i} x_{i} σ_{y_{i}, β}^{- 1} + + i = 1 \sum N α_{i} y_{i} σ_{y_{i}, β}^{- 3} (x_{i}^{T} β + β_{0}) j : y_{j} = y_{i} \sum [(x_{j} - \overline{x}) \cdot (\frac{β}{∥ β ∥})] [(x_{j} - \overline{x}) (\frac{1 ∥ β ∥ ^{2} - β \circ β}{∥ β ∥ ^{3}})]

0 = β - i = 1 \sum N α_{i} y_{i} x_{i} σ_{y_{i}, β}^{- 1} + + i = 1 \sum N α_{i} y_{i} σ_{y_{i}, β}^{- 3} (x_{i}^{T} β + β_{0}) j : y_{j} = y_{i} \sum [(x_{j} - \overline{x}) \cdot (\frac{β}{∥ β ∥})] [(x_{j} - \overline{x}) (\frac{1 ∥ β ∥ ^{2} - β \circ β}{∥ β ∥ ^{3}})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Face and Expression Recognition · Advanced Statistical Methods and Models

MethodsSupport Vector Machine

Full text

A Modified Construction for a Support Vector Machine to Accommodate Class Imbalances

Matt Parker, Colin Parker

Abstract

Given a training set with binary classification, the Support Vector Machine identifies the hyperplane maximizing the margin between the two classes of training data. This general formulation is useful in that it can be applied without regard to variance differences between the classes. Ignoring these differences is not optimal, however, as the general SVM will give the class with lower variance an unjustifiably wide berth. This increases the chance of misclassification of the other class and results in an overall loss of predictive performance. An alternate construction is proposed in which the margins of the separating hyperplane are different for each class, each proportional to the standard deviation of its class along the direction perpendicular to the hyperplane. The construction agrees with the SVM in the case of equal class variances. This paper will then examine the impact to the dual representation of the modified constraint equations.

1 A Recap: The Classical SVM Construction

For Section 1, we follow the construction given by Hastie, Tibshirani, and Freidman in The Elements of Statistical Learning [3]. We will parallel this approach in Section 2 when constructing the alternate method.

Suppose we have training data consisting of pairs of observations and labels, $(x_{i},y_{i})$ , for $i=1,...,N,$ with $x_{i}\in\mathbb{R}^{p}$ and $y_{i}\in\{-1,1\}$ . We may define a hyperplane by:

[TABLE]

where $\beta$ is a vector perpendicular to the hyperplane. An associated classification rule is induced by:

[TABLE]

The goal of finding a separating hyperplane which maximizes the margin $M$ for a linearly separable dataset, the minimum perpendicular distance to a datapoint of either class, can be formalized as:

[TABLE]

This can be more conveniently rephrased by removing the requirement $\beta$ be a unit vector, and setting $M=\frac{1}{\|\beta\|}$ :

[TABLE]

Now define slack variables $\zeta_{i},i=1,...,N$ by

[TABLE]

This gives us a framework to relax the assumption of linear separability. Noting that misclassifications occur when $\zeta_{i}>1$ , we see the slack variables are the proportion of the margin by which various points fall within their respective margins. We may control the amount of slack by imposing the additional condition:

[TABLE]

for some constant. This is computationally equivalent to the following expression:

[TABLE]

where the parameter $C$ replaces the constant in the previous expression. The corresponding Lagrange primal function is given by:

[TABLE]

which is to be minimized with respect to $\beta,\beta_{0}$ , and $\zeta_{i}$ . Setting the respective derivatives equal to zero, we get the equations:

[TABLE]

and positivity constraints $\alpha_{i},\mu_{i},\zeta_{i}\geq 0\forall i$ . By substituting the above three equations into the Lagrangian dual we obtain the Wolfe dual, given by:

[TABLE]

In addition, the Karush-Kuhn-Tucker conditions yield:

[TABLE]

for $i=1,...,N$ . These equations collectively uniquely define the solution to the dual problem.

2 A Modified Approach: Accommodating Difference in Class Variance

The original construction of the SVM for linearly separable data has the goal of maximizing the margin $M=\frac{1}{\|\beta\|}$ . In the event of a noticeable difference between class variances in the direction of $\beta$ (perpendicular to our separating hyperplane), the SVM ends up positioning the decision boundary closer to the class with larger variance [say, class A] than would be optimal. The new construction accommodates these class imbalances by increasing the margin of the class of greater variance.

It will be useful at this point to define a few terms. For class $K$ , element $x_{j}\in K$ , and separating hyperplane $\{x:x^{T}\beta+\beta_{0}=0\}$ , define $\sigma_{K,\beta}=\sigma_{y_{j},\beta}$ to be the standard deviation of elements of class $K$ in the direction of $\beta$ :

[TABLE]

and, for class $K$ and arbitrary hyperplane $\{x:x^{T}\beta+\beta_{0}=0\}$ , define the margin of class $K$ to be:

[TABLE]

We will now seek to find the separating hyperplane which maximizes $\min_{K}M_{K}$ , the minimum margin over all classes. As an aside, a byproduct of the classic construction of the SVM yields the equality $M_{A}=M_{B}$ when separating classes $A$ and $B$ , since the maximum margin is obtained when the separating hyperplane is midway between both classes. Our new construction will yield as a byproduct the equality:

[TABLE]

This shows that in the event our classes have equal variance in the direction of $\beta$ , the modified construction coincides with the classical SVM.

3 Examining Implications to Dual Representation

Maximizing $\min_{K}M_{K}$ modifies the optimization problem to the pair of equations:

[TABLE]

Slightly redefining slack variables according to the fraction of the respective margins they span yields:

[TABLE]

and the corresponding modified SVM equations are given by:

[TABLE]

We can now formulate the corresponding Lagrangian (primal) function as:

[TABLE]

which we again minimize with respect to $\beta,\beta_{0}$ , and $\zeta_{i}$ . Setting derivatives with respect to $\beta_{0}$ and $\zeta_{i}$ equal to zero, we get similar results:

[TABLE]

and a slightly more complex equation when doing the same with respect to $\beta$ :

[TABLE]

Expanding $\sigma_{y_{i},\beta}$ to its representation in (21), we may utilize the Hadamard product notation $\circ$ and the fact

[TABLE]

where $\circ$ is the Hadamard product, to obtain:

[TABLE]

where $\overrightarrow{\mathbf{1}}$ is the vector of ones [1, … , 1].

This gives us a working representation of the equivalent dual optimization equations under the new construction, and a forthcoming paper will be examining the solvability of the above in general in light of the other constraint equations, as well as consequent impacts to kernelizability of the method. We will also examine in depth the circumstances in which our alternate construction outperforms a traditional Support Vector Classifier, and attempt to quantify them.

Bibliography3

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Trevor Hastie, Robert Tibshirani, and Jerome Freidman. The Elements of Statistical Learning . Springer-Verlag, New York, New York, 2009.
2[2] Andrew Ng. CS 229 Lecture Notes . [ http://cs 229.stanford.edu/notes/cs 229-notes 3.pdf ]
3[3] Robert Gunn, Support Vector Machines for Classification and Regression . Technical Report for University of Southampton, Southampton, England, 1998.