Robust Support Vector Data Description with Truncated Loss Function for Outliers Depression

Huakun Chen; Yongxi Lyu; Jingping Shi; Weiguo Zhang

PMC · DOI:10.3390/e26080628·July 25, 2024

Robust Support Vector Data Description with Truncated Loss Function for Outliers Depression

Huakun Chen, Yongxi Lyu, Jingping Shi, Weiguo Zhang

PDF

Open Access

TL;DR

This paper improves the SVDD model for anomaly detection by introducing a truncated loss function framework, making it more robust to outliers and mislabeled data.

Contribution

A novel truncated loss function framework is introduced for SVDD, along with three new truncated loss functions and a fast ADMM algorithm for solving them.

Findings

01

The proposed truncated loss functions show superior robustness against outliers in synthetic and real-world datasets.

02

The fast ADMM algorithm efficiently solves various truncated loss functions and ensures convergence.

03

The new SVDD models outperform existing models in terms of generalization and noise handling.

Abstract

Support vector data description (SVDD) is widely regarded as an effective technique for addressing anomaly detection problems. However, its performance can significantly deteriorate when the training data are affected by outliers or mislabeled observations. This study introduces a universal truncated loss function framework into the SVDD model to enhance its robustness and employs the fast alternating direction method of multipliers (ADMM) algorithm to solve various truncated loss functions. Moreover, the convergence of the fast ADMM algorithm is analyzed theoretically. Within this framework, we developed the truncated generalized ramp, truncated binary cross entropy, and truncated linear exponential loss functions for SVDD. We conducted extensive experiments on synthetic and real-world datasets to validate the effectiveness of these three SVDD models in handling data with different…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

Depression

Figures4

Click any figure to enlarge with its caption.

Funding4

—National Natural Science Foundation of China
—Natural Science Foundation of Shaanxi Province
—Aeronautical Science Foundation of China
—Shaanxi Province Key Laboratory of Flight Control and Simulation Technology

Keywords

SVDDtruncated loss functionfast ADMMproximal operatorsanomaly detectiontruncated binary cross entropy loss functiontruncated linear exponential loss function

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Fault Detection and Control Systems · Advanced Statistical Methods and Models

Full text

1. Introduction

Anomaly detection refers to the identification of data points in a dataset that deviate from normal behavior. These deviations are known as anomalies or outliers in various application domains. This mode of detection is extensively used in real-world settings, including credit card detection, insurance detection, cybersecurity intrusion detection, error detection in security systems, and military activity monitoring [1,2]. However, the acquisition of anomaly data in practical applications, such as medical diagnostics, machine malfunction detection, and circuit quality inspection, is expensive [3,4]. Consequently, there is significant interest in one-class classification (OCC) problems, where training samples only include normal data (also known as target data) or normal data with a small number of anomalies (also referred to as non-target data) [5,6,7]. In this context, it is important to define the following terms:

Normal data: Normal data refers to data points that conform to the characteristics and behavior patterns of the majority of data points within a dataset. They represent the normal operating state of a system or process.Anomalies: Anomalies are data points that significantly deviate from normal patterns and usually reflect actual problems or critical events in the system.Outliers: Outliers are data points that are significantly different from other data points in the dataset, which may be due to natural fluctuations, special circumstances, or noise.Noise: Noise refers to irregular, random errors or fluctuations, usually caused by measurement errors or data entry mistakes, and does not reflect the actual state of the system.

Support vector data description (SVDD) is a method extensively used for one-class classifications (OCCs) [8]. The core idea of SVDD is to construct a hyper-sphere of minimal volume that encompasses all (or most) of the target class samples. Data points inside the hyper-sphere are considered normal, while those outside are considered anomalies. SVDD can be easily integrated with popular kernel methods or some deep neural network models [7], making it highly scalable and flexible. Due to these attractive features, SVDD has garnered significant attention and has been extensively developed. SVDD is regarded as an effective and excellent technique for anomaly detection problems; however, it remains sensitive to outliers and noise present in training datasets. In real-world scenarios, various issues, such as instrument failure, formatting errors, and unrepresentative sampling, result in datasets with anomalies, which degrade the performance of the SVDD [9,10].

The existing methods for mitigating the impact of noise are typically categorized as follows:

In order to reduce the impact of outliers on the OCC method, researchers have attempted to remove outliers through data preprocessing methods. Stanley Fong used methods such as cluster analysis to remove anomalies from the training set to achieve a robust classifier [11]. Breunig et al. tried to assign an outlier score to each sample in the dataset by estimating its local density, known as LOF [12]. Zheng et al. used LOF to filter raw samples and remove outliers [13]. Khan et al. [14] and Andreou and Karathanassi [15] calculated the interquartile range (IQR) of training samples, which provides a method for indicating a boundary beyond which samples are marked as outliers and removed. Clustering (k-means, DBSCAN) or LOF methods can significantly reduce noise points in data preprocessing, thereby improving the quality and effectiveness of subsequent model training. For datasets with obvious noise and significant distribution characteristics, preprocessing methods can be very effective in enhancing model performance. However, preprocessing methods have several issues: they require additional computational resources and time, especially with large datasets, potentially making the preprocessing step time-consuming. Moreover, there is a risk of overfitting and inadvertently deleting useful normal data points, impacting the model’s ability to accurately detect anomalies. Additionally, these preprocessing methods are sensitive to parameter settings, necessitating careful selection to achieve satisfactory results.

Zhao et al. proposed the dynamic radius SVDD method that accounts for hyper-sphere radius information and the existing data distribution [16,17]. This approach achieves a more flexible decision boundary by assigning different radii to different samples. Moreover, the framework for the dynamic radius SVDD method is based on the traditional SVDD framework. However, if the traditional SVDD has not undergone adequate training, the good performance of these dynamic approaches will be difficult to guarantee.

Density-weighted SVDD, position-weighted SVDD, Stahel–Donoho outlier-weighted SVDD, global plus local joint-weighted SVDD, and confidence-weighted SVDD are examples of weighted methods [18,19,20,21,22,23,24]. These methods assign smaller weights to sparse data, which are commonly outliers, thus excluding them from the sphere. These methods balance the target class data and outliers in the training phase, thus enhancing the classification performance, especially when the data are contaminated by outliers. However, when the number of outliers in the dataset increases and they form sparse clusters, the number of outliers might surpass that of normal samples. In such cases, weighted methods assign higher weights to the outliers and lower weights to the normal samples, leading to decreased algorithm performance.

The convex property of the hinge loss function of the SVDD algorithm makes it sensitive to outliers. To address this issue, Xing et al. proposed a new robust least squares one-class support vector machine (OCSVM) that employs a bounded, non-convex entropic loss function, instead of the unbounded convex quadratic loss function used in traditional least squares OCSVM [25]. The non-convex nature of the ramp loss function makes this model more robust than the traditional OCSVM [26]. Tian et al. introduced the ramp loss function to the traditional OCSVM to create the Ramp-OCSVM model [27], and the non-convex nature of this model makes it more robust than the traditional OCSVM. Xing et al. enhanced the robustness of the OCSVM by introducing a re-scaled hinge loss function [28]. Additionally, Zhong et al. proposed a new robust SVDD method, called pinball loss SVDD [29], to perform OCC tasks when the data are contaminated by outliers. The pinball loss function ensures minimal dispersion at the center of the sphere, thus creating a tighter decision boundary. Recently, Zheng introduced a mixed exponential loss function to the design of the SVDD model, enhancing its robustness and making its implementation easier [30].

Extensive research has shown that, as a result of unbounded convex loss functions being sensitive to anomalies, loss functions with boundedness or bounded influence functions are more robust to the influence of outliers. To address this issue, researchers introduced an upper limit to unbounded loss functions, effectively preventing them from increasing beyond a certain point. The truncated loss function thus makes the SVDD model more robust. The advantages of truncated loss functions include:

Robustness to noise: Truncated loss functions can enhance the model’s robustness and stability by limiting the impact of outliers without removing data points.Reduction of error propagation: In anomaly detection tasks, outliers may significantly contribute to the loss function, leading to error propagation and model instability. Truncated loss functions can effectively reduce error propagation caused by outliers, thereby improving overall model performance.Generalization ability: Using truncated loss functions can prevent the model from overfitting to outliers, enhancing the model’s generalization ability. Truncated loss functions are well-suited for various types of datasets and noise conditions, particularly when noise is not obvious or easily detectable.

However, robust SVDD algorithms still face considerable challenges in the research. They are designed to address specific types of losses and lack an appropriate framework for constructing robust loss functions. Thus, researchers are required to learn how to use different algorithms and modify loss functions before use. Since truncated loss functions are often non-differentiable, methods such as the difference of convex algorithm (DCA) [31] and concave–convex procedures (CCCPs) [32,33] are commonly employed to provide solutions. For some truncated loss functions, the DCA cannot ensure straightforward decompositions or the direct use of comprehensive convex toolboxes, potentially increasing development and maintenance costs [34]. At present, no unified framework exists in the literature for the design of robust loss functions or a unified optimization algorithm. Therefore, even though this is challenging, providing a new bounded strategy for the SVDD model is crucial, with the potential for developing more efficient and universally applicable solutions.

In response to the several issues previously outlined, this study proposes a universal framework for the truncated loss functions of the SVDD model. To address and solve the non-differentiable, non-convex optimization problem introduced by the truncated loss function, we employ the fast ADMM algorithm. Our contributions to this field of study are as follows:

We define a universal truncated loss framework that smoothly and adaptively binds loss functions, while preserving their symmetry and sparsity.To solve different truncated loss functions, we propose the use of a unified proximal operator algorithm.We introduce a fast ADMM algorithm to handle any truncated loss function within a unified scheme.We implement the proposed robust SVDD model for various datasets with different noise intensities. The experimental results for real datasets show that the proposed model exhibits superior resistance to outliers and noise compared to more traditional methods.

The remainder of this paper is organized as follows:

Section 2: We review related support vector data description (SVDD) models, providing a foundational understanding of the existing methodologies and their limitations.

Section 3: We propose a general framework for truncated loss functions. Within this framework, we examine the representative loss functions’ proximal operators and present a universal algorithm for solving these proximal operators.

Section 4: This section introduces the SVDD model that utilizes the truncated loss function, detailing its structure and theoretical framework.

Section 5: A new algorithm for solving the SVDD model with truncated loss functions is presented. This section also includes an analysis of the algorithm’s convergence properties, ensuring that the method is both robust and reliable.

Section 6: Numerical experiments and parameter analysis are conducted to validate the effectiveness of the proposed model. This section provides empirical evidence of the model’s performance across various datasets and noise scenarios.

Section 7: The conclusion summarizes the findings and contributions of the study, and discusses potential future research directions.

2. Related Works

SVDD has been widely applied in anomaly detection, and numerous learning algorithms based on SVDD have been proposed. In this section, we provide a brief overview of these algorithms.

2.1. SVDD

The goal of SVDD is to discover a hyper-sphere that encompasses target samples, while excluding non-target samples located outside it [8]. The objective function of SVDD is represented by the following equation:

[eqn]

where $[eqn]$ and $[eqn]$ represent the radius and center of the hyper-sphere, $[eqn]$ is a regularization parameter, and $[eqn]$ is a slack variable. By using Lagrange multipliers and incorporating the constraints into the objective function, the dual problem of Equation (1) can be expressed as:

[eqn]

where $[eqn]$ is the Lagrange multiplier, the optimization problem in Equation (2) is a standard quadratic programming problem, and $[eqn]$ can be obtained using quadratic programming algorithms. $[eqn]$ and $[eqn]$ can be calculated using the following equation:

[eqn]

[eqn]

where $[eqn]$ represents support vectors. The decision function for the test sample $[eqn]$ is presented as follows:

[eqn]

If $[eqn]$ , then sample $[eqn]$ belongs to the target class; otherwise, $[eqn]$ is considered a non-target class sample.

2.2. Robust SVDD Variants

Due to its sensitivity to outliers, the classification performance of SVDD significantly deteriorates when the data are contaminated. Thus, to enhance the robustness of SVDD, various improved SVDD methods have been proposed over the past decades.

2.2.1. Weighted SVDDs

One common approach is the use of weighted SVDD methods, where different slack variables are assigned different weights [18,19,20,21,22,23,24]. Although the specific methods for weight distribution vary, they can generally be represented in a unified form:

[eqn]

where $[eqn]$ represents pre-calculated weights. The density weighting method permits these weights to be calculated as follows [20]:

[eqn]

where $[eqn]$ denotes the k-nearest neighbor value of $[eqn]$ , and $[eqn]$ denotes the Euclidean distance between $[eqn]$ and $[eqn]$ . Due to outliers typically being located in relatively low-density areas, the distance to their neighboring samples is greater when compared to normal samples, which results in their smaller weights. The R-SVDD algorithm constructs weights by introducing a local density to each data point based on truncated distances [19].

[eqn]

where $[eqn]$ denotes the distance between $[eqn]$ and $[eqn]$ , and $[eqn]$ is the truncation distance. The calculation of the weight function is as follows:

[eqn]

Other methods for calculating can refer to [21,22,23,24]. The dual problem of Equation (6) is represented by the following equation:

[eqn]

From this equation, it can be observed that each Lagrange multiplier has an upper limit, denoted as $[eqn]$ .

2.2.2. Pinball Loss SVDD

The pinball loss SVDD (Pin-SVDD) modifies the hinge loss SVDD by replacing its loss function with the pinball loss function to create an optimized problem formulation [29]. This modification enables a more robust handling of outliers by adjusting the sensitivity toward deviations, depending on their direction relative to the decision boundary.

[eqn]

where $[eqn]$ is a constant. With the use of the Lagrange multipliers method, the dual optimization problem in Equation (11) can be expressed as follows:

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ . Zhong demonstrated that the use of a pinball to minimize scattering at the center of the sphere enhances the robustness of the developed model.

2.2.3. SVDD with Mixed Exponential Loss Function

Zheng used a mixed exponential loss function to design a classification model, and the optimization problem of the method is presented by the following equation [30]:

[eqn]

where $[eqn]$ is a mixed exponential function of $[eqn]$ , expressed as follows:

[eqn]

where $[eqn]$ and $[eqn]$ are two scale parameters, and $[eqn]$ is a mixture parameter used to balance the contributions of the two exponential functions. The mixed exponential loss function highlights the importance of samples involved in the target class while reducing the influence of samples that are outliers. This approach significantly enhances the robustness of the SVDD model. This loss function achieves more accurate and stable anomaly detection results in various settings by preserving the integrity of the target class and diminishing the effect of potential anomalies.

3. Truncated Loss Function

SVDD models with unbounded loss functions can achieve satisfactory results when addressing scenarios lacking noise. However, the continual growth of these loss functions results in the collapse of the model when it is subjected to noise. Therefore, truncating the SVDD model’s loss function makes it more robust. The general definition of a truncated loss function is as follows:

[eqn]

where $[eqn]$ is a constant, and $[eqn]$ is an unbounded loss function, such that, when $[eqn]$ , $[eqn]$ . Since $[eqn]$ is an abstract function, a general form of the truncated loss function includes several loss functions. The three specific truncated loss functions we created in our study are present as follows:

Truncated generalized ramp loss function: $[eqn]$ , where $[eqn]$ , $[eqn]$ .Truncated binary cross entropy loss function: $[eqn]$ , where $[eqn]$ , $[eqn]$ .Truncated linear exponential loss function: $[eqn]$ , where $[eqn]$ , $[eqn]$ , and $[eqn]$ .

Assuming the truncation point $[eqn]$ , the mathematical properties of the three truncated loss functions presented above can be summarized as follows:

For samples with $[eqn]$ , the loss value is 0; for samples with $[eqn]$ , the loss value is $[eqn]$ . Thus, the general truncated loss function exhibits sparsity and robustness to outliers. $[eqn]$ and $[eqn]$ are truncated concave loss functions, which are non-differentiable at $[eqn]$ . $[eqn]$ is a truncated convex loss function, which is non-differentiable at $[eqn]$ and differentiable at $[eqn]$ . $[eqn]$ and $[eqn]$ exhibit explicit expressions for the proximal operators, while $[eqn]$ does not.

In the next section, we provide explicit expressions for the proximal operators of $[eqn]$ and $[eqn]$ .

3.1. Proximal Operators of Truncated Loss Functions

Definition 1 (Proximal Operator [35]). Assume $[eqn]$ *: $[eqn]$ *

is a proper lower-semi-continuous loss function. The expression for the proximal operator of $[eqn]$ *
at $[eqn]$ *
is defined as follows:*

[eqn]

when $[eqn]$ is a convex loss function, it presents a single-value proximal operator; when $[eqn]$ is a non-convex loss function, it exhibits a multi-value proximal operator.

Lemma 1 ([36]). *When $[eqn]$ * *, let $[eqn]$ * , $[eqn]$ . The expressions for the proximal operators are as follows:

[eqn]

$[eqn]$ .

Lemma 2. The explicit expression of the $[eqn]$

proximal operator is as follows:*

1. *When * $[eqn]$ *, the explicit expression of the * $[eqn]$

proximal operator is presented as follows:*

[eqn]

2. *When * $[eqn]$ *, the explicit expression of the * $[eqn]$

proximal operator is as follows: *

[eqn]

Proof of Lemma 1. Equation (16) exhibits that $[eqn]$ is a local minimum of the following piecewise function:

[eqn]

The minima of the piecewise functions $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ are located at $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , with minimum values of $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , respectively.Since $[eqn]$ , it follows that $[eqn]$ . If $[eqn]$ , then $[eqn]$ . When $[eqn]$ and $[eqn]$ is in the interval $[eqn]$ , $[eqn]$ .When $[eqn]$ , the following conclusion can be reached by comparing the values of $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ .

(1.1)Since $[eqn]$ , we achieve $[eqn]$ , which means $[eqn]$ .
(1.2)Since $[eqn]$ , we obtain $[eqn]$ , which means $[eqn]$ or $[eqn]$ .
(1.3)Since $[eqn]$ , we achieve $[eqn]$ , which means $[eqn]$ .
(1.4)Since $[eqn]$ , we obtain $[eqn]$ , which means $[eqn]$ .
(1.5)Since $[eqn]$ , we achieve $[eqn]$ , which means $[eqn]$ .

According to (1.1)–(1.5), Equation (18) can be derived.When $[eqn]$ , the following conclusion can be reached by comparing the values of $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ .

(2.1)As $[eqn]$ , we obtain $[eqn]$ , which means $[eqn]$ .
(2.2)As $[eqn]$ , we obtain $[eqn]$ , which either means $[eqn]$ or $[eqn]$ .
(2.3)As $[eqn]$ , we obtain $[eqn]$ , which means $[eqn]$ .
(2.4)As $[eqn]$ , we obtain $[eqn]$ , which means $[eqn]$ .

According to (2.1)–(2.4), Equation (19) can be derived. □

Lemma 3. The expression representing the proximal operator of the truncation function is as follows:

[eqn]

*where * $[eqn]$ *, * $[eqn]$ *, * $[eqn]$ *, and * $[eqn]$

represent the minimizers of the piecewise function, and * $[eqn]$ *, * $[eqn]$ *, * $[eqn]$ *, and * $[eqn]$ , represent the minimal values of the piecewise function.

Proof of Lemma 3. It can be deduced from Equations (15) and (16) that $[eqn]$ represents the local minimum of the following piecewise function:

[eqn]

Let $[eqn]$ ; the minimizers of the piecewise functions are $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , and their minimal values are $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , respectively. From $[eqn]$ , it follows that $[eqn]$ .When $[eqn]$ , the stage function’s minimal values are $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ ; when $[eqn]$ , the stage function’s minimal values are $[eqn]$ , $[eqn]$ , and $[eqn]$ . Thus, we can observe that $[eqn]$ . If $[eqn]$ , then $[eqn]$ ; similarly, if $[eqn]$ , then $[eqn]$ . We can determine the following conclusions by comparing the values of $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ :

(1.1)When the conditions of $[eqn]$ , $[eqn]$ , and $[eqn]$ are met, and it follows that $[eqn]$ ;
(1.2)When the condition of $[eqn]$ is met, if $[eqn]$ is true, then $[eqn]$ or $[eqn]$ can be derived;
(1.3)When the condition of $[eqn]$ is met, if $[eqn]$ is true, then $[eqn]$ or $[eqn]$ can be derived;
(1.4)When the conditions of $[eqn]$ , $[eqn]$ , and $[eqn]$ are met, it follows that $[eqn]$ ;
(1.5)When the conditions of $[eqn]$ , $[eqn]$ , and $[eqn]$ are met, it follows that $[eqn]$ ;
(1.6)When the condition of $[eqn]$ is met, it follows that $[eqn]$ .

Thus, it is possible to successfully derive Equation (20). □

3.2. The Use of the Proximal Operator Algorithm to Solve Truncated Loss Functions

When $[eqn]$ in the truncated loss function is a monotonic and non-piecewise function, and $[eqn]$ can be expressed explicitly, the proximal operator of the truncated loss function can be calculated using Formula (20). In practical applications, however, it is sometimes impossible to obtain the explicit expression for $[eqn]$ ; for example, $[eqn]$ does not provide an explicit expression. The calculation of the proximal operator in such scenarios is discussed below.

For $[eqn]$ , if it is smooth and has a second derivative, the problem is a smooth unconstrained optimization problem. Newton’s method is used to solve for the minimum of $[eqn]$ in unconstrained optimization problems due to its high convergence rate. The gradient and Hessian matrix for problem (16) can be expressed as follows:

[eqn]

[eqn]

The minimizer $[eqn]$ and the minimal value $[eqn]$ can be obtained with Newton’s method for $[eqn]$ .

If an explicit expression for $[eqn]$ cannot be achieved, the calculation of the proximal operator follows the same process as Lemma 3. Once the minimizers $[eqn]$ and $[eqn]$ are obtained, the proximal operator can be calculated. When the conditions of $[eqn]$ , $[eqn]$ , and $[eqn]$ are met, we can derive $[eqn]$ . Therefore, Formula (20) can be modified to express the following:

[eqn]

Based on the analysis presented above, the algorithm for solving the proximal operator of the truncated loss function is as following Algorithm 1: Algorithm 1: Algorithm for solving the proximal operator of the truncated loss function $[eqn]$ $[eqn]$ $[eqn]$ $[eqn]$ . $[eqn]$ do $[eqn]$ according to Formula (21). $[eqn]$ $[eqn]$ is obtained. $[eqn]$ according to Formula (22). $[eqn]$ . $[eqn]$ . $[eqn]$ .9: End $[eqn]$ . $[eqn]$ according to Formula (23).

4. Robust SVDD Model

Formula (1), for the SVDD formula, can be rewritten as follows:

[eqn]

where $[eqn]$ represents the hinge loss function. Since the hinge loss function is sensitive to outliers, it can be replaced with the truncated loss function from Formula (15). Thus, the objective function for obtaining the robust SVDD model is as follows:

[eqn]

As the truncated loss function is non-differentiable, solving the objective function of the robust SVDD model is a non-convex optimization problem, and it cannot be solved using standard SVDD model methods.

Theorem 1 (Nonparametric Representation Theorem [37]). *Suppose we are designated a non-empty set $[eqn]$ * *; a positive definite real-valued kernel *

$[eqn]$ **: * $[eqn]$ *; a training sample * $[eqn]$ *; a strictly monotonically increasing real-valued function * $[eqn]$ * on * $[eqn]$ *; an arbitrary cost function * $[eqn]$ *: * $[eqn]$ *; and a class of functions: *

[eqn]

In this scenario, $[eqn]$ represents the norm in RKHS, $[eqn]$ associated with $[eqn]$ , i.e., for any $[eqn]$ .

[eqn]

Then, any $[eqn]$ minimizing the regularized risk function

[eqn]

admits a representation of $[eqn]$ , where $[eqn]$ represents coefficients of $[eqn]$ in RKHS $[eqn]$ .

A set of vectors $[eqn]$ exists in the nonparametric representation theorem, where the center $[eqn]$ is the optimal solution for problem (25). Therefore, Formula (25) can be transformed into the following:

[eqn]

Formula (28) represents the single-class SVDD model. To obtain data that include negative samples, these samples must be integrated into the SVDD model; then, the center is $[eqn]$ , and the objective function of the robust SVDD model is as follows:

[eqn]

when $[eqn]$ , it follows that $[eqn]$ . Problem (29) is rewritten as the following matrix form:

[eqn]

where $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ . This study discusses the use of the SVDD model as a solution for addressing data with negative samples, and the Lagrangian function for problem (30) is as follows:

[eqn]

The KKT conditions for problem (30) are provided below:

[eqn]

where $[eqn]$ represents any KKT point.

The generalized non-smooth optimization problem can be represented by the following Formula [38]:

[eqn]

where $[eqn]$ and $[eqn]$ are continuously differentiable functions on $[eqn]$ , and $[eqn]$ represents a non-smooth function on $[eqn]$ . Problem (31) is considered as a form of the aforementioned generalized non-smooth optimization problem, where $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , representing the optimization model’s constraints, are nonlinear equality constraints. In this study, the fast ADMM algorithm was employed to solve problem (33), and the algorithm will be introduced in a subsequent chapter.

5. Fast ADMM Algorithm

The previous section presented the optimization mathematical model of the robust SVDD model. Since the optimization mathematical model includes nonlinear equality constraints, the fast ADMM method was used to solve Formula (33). The augmented Lagrangian function of the robust SVDD model is as follows:

[eqn]

where $[eqn]$ represents the vector of Lagrange multipliers, and $[eqn]$ represents the penalty factor, $[eqn]$ .

5.1. Fast ADMM Algorithm Framework

Based on the augmented Lagrangian function previously presented, we obtained the following framework for the fast ADMM algorithm:

[eqn]

[eqn]

[eqn]

[eqn]

1. Computing $[eqn]$

The solution of $[eqn]$ in Formula (35) is equivalent to the following problem:

[eqn]

where $[eqn]$ . Therefore, the calculation of $[eqn]$ can be transformed into the computation of the proximal operator, which can be determined using Formula (20) or Algorithm 1.

2. **Computing ** $[eqn]$

When solving for variable $[eqn]$ , if a closed-form solution for this subproblem is not obtained, an optimization algorithm must be used for the iterative solution, which results in a slow computational speed. Thus, to solve this problem in a more efficient manner, we used a linearization technique.

Given the convex differentiable function $[eqn]$ , the Bregman distance between $[eqn]$ and $[eqn]$ , $[eqn]$ is defined as follows:

[eqn]

When $[eqn]$ , the Bregman distance $[eqn]$ is as follows:

[eqn]

Formula (36) is equivalent to the following:

[eqn]

With the use of Formula (42), we obtain

[eqn]

where $[eqn]$ .

3. Computing $[eqn]$

If $[eqn]$ , based on Formula (40), we obtain

[eqn]

Formula (37) is equivalent to

[eqn]

Formula (45) represents a convex quadratic programming problem. By solving the following Equation (46), we also solve Formula (45):

[eqn]

The value of $[eqn]$ is directly updated using the following formula:

[eqn]

4. Computing $[eqn]$

$[eqn]$ can be calculated using Formula (38). When $[eqn]$ , the Lagrange multiplier is removed.

Base on above analysis, the framework of our method can be summarized in Algorithm 2. Algorithm 2: Fast ADMM Algorithm $[eqn]$ $[eqn]$ $[eqn]$ $[eqn]$ $[eqn]$ . $[eqn]$ perform the following $[eqn]$ according to Formula (43). $[eqn]$ according to Formula (47). $[eqn]$ has an explicit expression, use Formula (20) for the $[eqn]$ $[eqn]$ does not have an explicit expression, use $[eqn]$ $[eqn]$ according to Formula (38). $[eqn]$ .9: End10: $[eqn]$ .

5.2. Global Convergence Analysis of the Fast ADMM Algorithm

In this section, we provide a convergence analysis of the fast ADMM algorithm. Specifically, the convergence of the algorithm is discussed using Lemma 7, and Lemma 7 is proven. According to Formulas (35)–(38), it can be observed that the optimality conditions for each update iteration of the fast ADMM algorithm can be written as follows:

[eqn]

where $[eqn]$ .

Lemma 4. *Assume * $[eqn]$

is a sequence generated using the fast ADMM algorithm; then, for any * $[eqn]$ *, we obtain the following: *

[eqn]

*where * $[eqn]$

represents the strictly positive minimum eigenvalue of * $[eqn]$ .

Proof of Lemma 4. According to the optimality conditions of the iteration, the following equation is relevant:

[eqn]

According to the Cauchy–Schwarz inequality, it follows that

[eqn]

Thus,

[eqn]

□

Lemma 5. *Assume * $[eqn]$

is a sequence generated using the fast ADMM algorithm; then, for any * $[eqn]$ *, the following is true: *

[eqn]

Proof of Lemma 5. From the definition of the augmented Lagrangian function presented in Formula (34), it can be argued that

[eqn]

Formula (48) shows the following:

[eqn]

Therefore,

[eqn]

By substituting Formula (53) into (51), the following is achieved:

[eqn]

Thus, the proof is completed. □

Lemma 6. *Assume * $[eqn]$

is a sequence generated using the fast ADMM algorithm; then, for * $[eqn]$ *, we obtain the following: *

[eqn]

Proof of Lemma 6. The definition of the augmented Lagrangian function in Formula (34) shows the following:

[eqn]

According to the first-order condition for convex functions, and since $[eqn]$ is a convex function, we obtain the following:

[eqn]

The substitution of Formula (56) yields the following:

[eqn]

Thus, the proof is complete. □

Lemma 7. *Assume * $[eqn]$

is a sequence generated using the fast ADMM algorithm; if * $[eqn]$
and * $[eqn]$ *, then for any * $[eqn]$ *, we obtain the following: *

[eqn]

Proof of Lemma 7. The definition of the augmented Lagrangian function in Formula (36) shows the following:

[eqn]

Since $[eqn]$ is a global minimum with respect to the variable $[eqn]$ , then

[eqn]

The combination of Lemmas 5 and 6 produces the following:

[eqn]

When $[eqn]$ and $[eqn]$ , it follows that

[eqn]

When the conditions of $[eqn]$ and $[eqn]$ are met, according to Lemma 7, $[eqn]$ monotonically decreases, thus causing the fast ADMM algorithm to converge. □

5.3. Fast ADMM Algorithm Termination Conditions

Based on Formula (35), it can be observed that the $[eqn]$ that minimizes $[eqn]$ can be derived as follows:

[eqn]

where

[eqn]

Thus, we obtain the following equation:

[eqn]

Equation (60) represents the residual value of the dual feasibility condition. Hence, $[eqn]$ represents the dual residual of iteration.

[eqn]

[eqn]

[eqn]

Equations (61)–(63) represent the primal residuals of iteration. The KKT conditions of problem (32) consist of four components, corresponding to the primal and dual residuals. These two types of residuals gradually converge to zero with the use of the fast ADMM algorithm.

Both the primal and dual residuals must be sufficiently small, meeting the conditions presented below, for the termination of the use of the fast ADMM algorithm:

[eqn]

where $[eqn]$ . Typically, $[eqn]$ is selected so that $[eqn]$ ; in this study, $[eqn]$ . This ensures that the algorithm terminates when the solution’s accuracy is within an acceptable range.

6. Experiment

In this section, we describe extensive experiments conducted on various datasets to validate the effectiveness and robustness of the three truncated loss function SVDD algorithms proposed in this study, which are presented below. The experimental datasets consisted of synthetic and several UCI datasets. We also compared hinge loss SVDD [8], DW-SVDD [20], R-SVDD [19], and GL-SVDD [16] algorithms to validate their performance.

$[eqn]$ -SVDD: SVDD algorithm with a truncated generalized ramp loss function; $[eqn]$ -SVDD: SVDD algorithm with a truncated binary cross entropy loss function; $[eqn]$ -SVDD: SVDD algorithm with a truncated linear exponential loss function.

6.1. Experimental Setup

This subsection introduces the evaluation metrics, kernels, and parameter settings used in the experiments.

6.1.1. Evaluation Metrics

We used G-mean and $[eqn]$ -score as evaluation metrics to assess the performance of the different methods described in this study [30,39,40,41]. They are based on TP, FN, FP, and TN, where TP denotes the number of target data predicted as target data, FN denotes the number of target data predicted as outlier data, FP denotes the number of outlier data predicted as target data, and TN denotes the number of outlier data predicted as outlier data.

[eqn]

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ . It can be seen from (65) and (66) that G-mean can provide a good balance between recall and specificity, and $[eqn]$ -score provides a good balance between precision and recall. Hence, both of them can measure the performance of the proposed method, comprehensively.

6.1.2. Kernels

The Gaussian kernel function has the best mapping ability and is the most practical method, commonly used to handle the classification problems of nonlinear data, and we used it when conducting our experiments. The definition of the Gaussian kernel function is as follows:

[eqn]

6.1.3. Parameter Configuration

To ensure a fair comparison, the parameters for each method were selected using the grid search method. For all methods, the regularization parameter $[eqn]$ was selected from $[eqn]$ . Arin Chaudhuri [42] has demonstrated that there exists a $[eqn]$ in the interval $[eqn]$ that maximizes the SVDD optimization objective function. Therefore, the range for the kernel parameter $[eqn]$ can be set as: $[eqn]$ $[eqn]$ .

The DW-SVDD and GL-SVDD algorithms locate the k-nearest neighbors in the feature space, with the parameter $[eqn]$ selected from $[eqn]$ . All of the truncated loss function SVDD algorithms select the parameter $[eqn]$ from $[eqn]$ , $[eqn]$ -SVDD selects parameter $[eqn]$ from $[eqn]$ , and $[eqn]$ -SVDD selects parameter $[eqn]$ from $[eqn]$ . The parameter $[eqn]$ for $[eqn]$ -SVDD is selected from $[eqn]$ .

6.2. Synthetic Datasets with Noise

In this study, we constructed two synthetic datasets: circular and banana-shaped, to test the robustness of the three proposed truncated loss function support vector data description (SVDD) algorithms. To verify the robustness of the algorithms in handling different types of noise, we introduced two types of noise:

Neighboring noise: Noise points are randomly distributed near the normal samples but do not completely overlap with the normal samples, forming a more discrete distribution characteristic. This noise simulates the common boundary ambiguity in practical applications.Regional noise: Noise points are randomly distributed within a specified area, forming sparse clusters. This noise simulates possible local anomalies in practical applications.

The description of the synthetic dataset is given below:

Circular Dataset

Normal samples: This consists of 300 two-dimensional sample points distributed within a concentric circle, forming a normal distribution pattern.
Noise samples: This consists of 40 noise points, divided into two types: 20 noise points randomly distributed near the normal sample points, showing a discrete distribution characteristic, and another 20 noise points randomly distributed within a specified area, forming sparse clusters.
Illustration: Figure 1a shows the circular dataset, where blue points represent normal samples, and red points represent noise.

Banana-shaped Dataset

Normal samples: This consists of 300 two-dimensional sample points distributed along curved lines, resembling a banana shape.
Noise samples: This consists of 40 noise points, divided into two types: 10 noise points randomly distributed near the banana-shaped normal sample points, and another 30 noise points randomly distributed within a specified area, forming sparse clusters.
Illustration: Figure 1b shows the banana-shaped dataset, where blue points represent normal samples, and red points represent noise samples.

By choosing these noise distributions and quantities, we aimed to simulate different real-world noise scenarios and test the robustness of the proposed algorithms under varying noise conditions.

For circular outliers, the classification results of SVDD, DW-SVDD, DBSCAN, $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD algorithms for this dataset are presented in Figure 2. We can observe that the SVDD algorithm accurately identifies seven noise points as outliers, while misclassifying thirty-three noise points as normal values. Therefore, the performance of SVDD is severely affected by noise. Unlike SVDD, DW-SVDD accurately identified 22 noise points as outliers, while misclassifying 18 clustered noise points as normal values. DW-SVDD is based on a weighted idea where outliers are commonly sparse data; hence, a smaller weight is designated to these sparse data, which are then excluded from the sphere. The weighted method struggles to exclude clustered noise since 20 clustered noise points are present in the circular outliers. DBSCAN accurately identified the 20 sparsely distributed noise points but incorrectly classified the 20 clustered noise points as normal values. Therefore, DBSCAN cannot accurately identify clustered noise. The $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD algorithms successfully identified all of the noise points, and their classification boundaries encompassed all of the target samples.

The classification results of the SVDD, DW-SVDD, DBSCAN, $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD algorithms for the banana-shaped dataset are presented in Figure 3. It can be observed that SVDD only identifies two noise points, while misclassifying the remaining noise points as normal values. DW-SVDD accurately differentiated nine noise points distributed around the banana data, while misclassifying the remaining noise points as normal values. $[eqn]$ -SVDD accurately identified 35 noise points as outliers, while misclassifying the remaining noise points as normal values, and erroneously classifying 16 target values as outliers. $[eqn]$ -SVDD accurately identified 36 noise points as outliers, while misclassifying the remaining noise points as normal values, and erroneously classifying 14 target values as outliers. $[eqn]$ -SVDD accurately identified 38 noise points as outliers, while misclassifying the remaining noise points as normal values, and erroneously classifying 13 target values as outliers.

From the experiments on the two synthetic datasets, it can be seen that the DBSCAN algorithm is very effective in handling obvious noise, but it cannot accurately identify noise when the noise is less obvious or close to normal data. DW-SVDD and DBSCAN methods cannot accurately identify sparse clustered noise, whereas $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD can effectively identify clustered noise in the dataset. Figure 1 and Figure 2 show that the classification boundaries of $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD are tighter and smoother than SVDD and DW-SVDD methods. Therefore, it can be determined that the three truncated loss function SVDD algorithms proposed in this study are more robust to noise in both synthetic datasets.

6.3. UCI Datasets with the Presence of Noise

To further validate the effectiveness of the proposed method, we used eight standard datasets obtained from the UCI machine learning repository to conduct our experiments [43]. For each standard dataset, one class of samples was used as normal data, and the remaining classes were used as outlier data. Moreover, to eliminate the impact of data scale, each feature of every dataset was normalized to $[eqn]$ . Grid parameter optimization was used to determine the optimal parameters in all of the experiments, and the average value of ten repetitions of each experiment was the final result. For each dataset, parts of the target and non-target data were randomly selected for the training set, and the remaining target and non-target data were selected for the test set.

6.3.1. Training Dataset without Non-Target Data

For each dataset, 70% of the normal data were randomly selected for the training set, and noise-contaminated data were added to the training set. Noise data were created by changing the labels of non-target data from −1 to 1, with added noise data proportions of 10%, 20%, 30%, 40%, and 50% non-target data, and the test set consisted of the remaining target and non-target data.

Table 1 presents the G-mean averages for different methods across 10 trials, with the best results presented in bold. In 40 experimental datasets, $[eqn]$ -SVDD and $[eqn]$ -SVDD achieved higher G-means on 35 datasets compared to benchmark methods, while $[eqn]$ -SVDD achieved higher G-means on 34 datasets. As the proportion of noise-contaminated data increases, i.e., more data are contaminated by noise, the classification accuracy of all models generally declines. It can be observed that, as the proportion of noise-contaminated data increases, the $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD algorithms show less of a decline in accuracy compared to the other algorithms. Taking the Iris dataset as an example, when the proportion of noise-contaminated data $[eqn]$ increases from 0.1 to 0.5, the G-means of $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD decline by 13.02%, 13.02%, and 11.5%, while those of the DW-SVDD, R-SVDD, and GLE-SVDD decline by 19.55%, 16.6%, and 17.7%, respectively.

Table 2 shows the $[eqn]$ -score averages of different methods over 10 trials. In the 40 experimental datasets, $[eqn]$ -SVDD and $[eqn]$ -SVDD achieved higher $[eqn]$ -scores than the compared methods in 32 datasets, while $[eqn]$ -SVDD had higher $[eqn]$ -scores in 31 datasets. $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD demonstrate higher classification accuracy in most test datasets compared to current noise-resistant SVDD models (i.e., DW-SVDD, R-SVDD, and GL-SVDD), indicating that the use of a truncated loss function makes SVDD less sensitive to noise.

6.3.2. Training Datasets with Non-Target Data

For each dataset, the training set was randomly selected to consist of 70% target and 20% non-target data, with added proportions of 10%, 20%, 30%, 40%, and 50% non-target data as noise data, and the test set included the remaining target and non-target data.

Table 3 presents the G-means from 10 trials for different methods, with the best results presented in bold. In 40 experimental datasets, $[eqn]$ -SVDD and $[eqn]$ -SVDD achieved higher G-means on 35 datasets compared to the benchmark methods, while $[eqn]$ -SVDD achieved a higher G-mean on 33 datasets. As the data contaminated by noise increase, the $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD algorithms exhibit less of a decline in accuracy than the other algorithms. $[eqn]$ -SVDD, $[eqn]$ -SVDD, and $[eqn]$ -SVDD present almost no decline in G-means for datasets such as Blood, Iris, and Haberman.

Table 4 shows the $[eqn]$ -score averages of different methods over 10 trials. In the 40 experimental datasets, $[eqn]$ -SVDD achieved higher $[eqn]$ -scores than the compared methods in 35 datasets, $[eqn]$ -SVDD performed better in 32 datasets, and $[eqn]$ -SVDD in 31 datasets. From Table 1, Table 2, Table 3 and Table 4, it is evident that -SVDD, -SVDD, and -SVDD achieve better experimental performance than the other four methods, demonstrating superior noise resistance due to the use of the truncated loss function.

7. Conclusions

This study aimed to enhance the robustness and effectiveness of the SVDD algorithm. We propose a general framework for the truncated loss function of this algorithm, which uses bounded loss functions to mitigate the impact of outliers. Due to the non-differentiability of the truncated loss function, we employed the fast ADMM algorithm to solve the SVDD model with the truncated loss function, which handles truncated loss functions within a unified framework. In this context, the truncated generalized ramp, truncated binary cross entropy, and truncated linear exponential loss functions for the SVDD algorithm were constructed, and extensive experiments show that these three SVDD models exhibit more robustness than other SVDD models in most cases. However, this method still has the following shortcomings. Firstly, introducing the truncated loss function increases the complexity of model training, as some truncated loss functions cannot directly provide explicit expressions for neighboring point operators, requiring additional computational overhead. These factors may limit the application of the method to large-scale datasets. To overcome these limitations, future work can consider using a distributed computing framework to accelerate the training process of the ADMM algorithm. Secondly, the truncated loss function SVDD introduces new free parameters, which increases the time required for grid search parameter selection. When the data scale is large, the computation time for the grid search method may become unacceptable. To address this drawback, algorithms such as Bayesian optimization can be considered in the future to find the optimal parameters, further improving model performance and optimization efficiency. Finally, for extremely noisy data, the truncated loss function may not completely eliminate its impact, and the effect is limited. In this case, methods combining clustering algorithms such as DBSCAN can be adopted. First, clustering algorithms like DBSCAN can be used to preprocess the data and remove noise, and then the proposed method can be used to detect anomalies.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chandola V. Banerjee A. Kumar V. Anomaly detection: A survey ACM Comput. Surv.2009411510.1145/1541880.1541882 · doi ↗
2Pimentel M.A. Clifton D.A. Clifton L. Tarassenko L. A review of novelty detection Signal Process.20149921524910.1016/j.sigpro.2013.12.026PMC 396345724683434 · doi ↗ · pubmed ↗
3Lei Y. Yang B. Jiang X. Jia F. Li N. Nandi A.K. Applications of machine learning to machine fault diagnosis: A review and roadmap Mech. Syst. Signal Process.202013810658710.1016/j.ymssp.2019.106587 · doi ↗
4Hasani R. Wang G. Grosu R. A machine learning suite for machine components’ health-monitoring Proceedings of the 33rd AAAI Conference on Artificial Intelligence Honolulu, HI, USA 27 January–1 February 201994729477
5Khan S.S. Madden M.G. One-class classification: Taxonomy of study and review of techniques Knowl. Eng. Rev.20142934537410.1017/S 026988891300043 X · doi ↗
6Khan S.S. Madden M.G. A survey of recent trends in one class classification Proceedings of the 20th Annual Irish Conference on Artificial Intelligence and Cognitive Science Dublin, Ireland 19–21 August 200918819710.1007/978-3-642-17080-5_21 · doi ↗
7Alam S. Sonbhadra S.K. Agarwal S. Nagabhushan P. One-class support vector classifiers: A survey Knowl.-Based Syst.202019610575410.1016/j.knosys.2020.105754 · doi ↗
8Tax D.M.J. Duin R.P.W. Support vector data description Mach. Learn.200454456610.1023/B:MACH.0000008084.60811.49 · doi ↗