One-class classification with application to forensic analysis

Laura Anderlucci; Francesca Fortunato; Angela Montanari

arXiv:1905.02406·stat.AP·May 8, 2019

One-class classification with application to forensic analysis

Laura Anderlucci, Francesca Fortunato, Angela Montanari

PDF

Open Access

TL;DR

This paper introduces a novel one-class classification method using Gini's transvariation probability, applied to forensic glass analysis to improve the identification of criminal evidence.

Contribution

It proposes the Transvariation-based One-Class Classifier (TOCC), a new approach leveraging transvariation probability for forensic classification tasks.

Findings

01

Effective in distinguishing target glass fragments from non-targets

02

Improves accuracy in forensic glass comparison

03

Offers a new measure of typicality for one-class classification

Abstract

The analysis of broken glass is forensically important to reconstruct the events of a criminal act. In particular, the comparison between the glass fragments found on a suspect (recovered cases) and those collected on the crime scene (control cases) may help the police to correctly identify the offender(s). The forensic issue can be framed as a one-class classification problem. One-class classification is a recently emerging and special classification task, where only one class is fully known (the so-called target class), while information on the others is completely missing. We propose to consider classic Gini's transvariation probability as a measure of typicality, i.e. a measure of resemblance between an observation and a set of well-known objects (the control cases). The aim of the proposed Transvariation-based One-Class Classifier (TOCC) is to identify the best boundary around the…

Equations26

L R = \frac{f ( R I , N a ^{'} , M g ^{'} , A l ^{'} , S i ^{'} , K ^{'} , C a ^{'} , B a ^{'} , F e ^{'} ∣ H _{0} )}{f ( R I , N a ^{'} , M g ^{'} , A l ^{'} , S i ^{'} , K ^{'} , C a ^{'} , B a ^{'} , F e ^{'} ∣ H _{1} )} .

L R = \frac{f ( R I , N a ^{'} , M g ^{'} , A l ^{'} , S i ^{'} , K ^{'} , C a ^{'} , B a ^{'} , F e ^{'} ∣ H _{0} )}{f ( R I , N a ^{'} , M g ^{'} , A l ^{'} , S i ^{'} , K ^{'} , C a ^{'} , B a ^{'} , F e ^{'} ∣ H _{1} )} .

τ = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n},

τ = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n},

tp (c) = \frac{τ}{( 1/2 )} = 2 \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n} .

tp (c) = \frac{τ}{( 1/2 )} = 2 \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n} .

τ_{f} = min [F (c), 1 - F (c)],

τ_{f} = min [F (c), 1 - F (c)],

t p_{f} (c) = \frac{τ _{f}}{( 1/2 )} = 2 \cdot {F (c) 1 - F (c) m_{X} \geq c m_{X} < c .

t p_{f} (c) = \frac{τ _{f}}{( 1/2 )} = 2 \cdot {F (c) 1 - F (c) m_{X} \geq c m_{X} < c .

τ = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n},

τ = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{n},

m_{X} = (m_{1}, \dots, m_{p})

m_{X} = (m_{1}, \dots, m_{p})

tp (c) = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{s _{Y} + \frac{s _{Y}^{'}}{2}} .

tp (c) = \frac{s _{X} + \frac{s _{X}^{'}}{2}}{s _{Y} + \frac{s _{Y}^{'}}{2}} .

t p_{f} (c) = \frac{\int _{a_{x_{1}} M}^{b_{x_{1}} M} \dots \int _{a_{x_{p}} M}^{b_{x_{p}} M} f ( x ) d x}{\int _{a_{M x_{1}}}^{b_{M x_{1}}} \dots \int _{a_{M x_{p}}}^{b_{M x_{p}}} f ( x ) d x}

t p_{f} (c) = \frac{\int _{a_{x_{1}} M}^{b_{x_{1}} M} \dots \int _{a_{x_{p}} M}^{b_{x_{p}} M} f ( x ) d x}{\int _{a_{M x_{1}}}^{b_{M x_{1}}} \dots \int _{a_{M x_{p}}}^{b_{M x_{p}}} f ( x ) d x}

tp (c) = u \prod tp (c_{u}) u = 1, \dots, p,

tp (c) = u \prod tp (c_{u}) u = 1, \dots, p,

tp (z) \geq t [t p_{f} (z) \geq t] .

tp (z) \geq t [t p_{f} (z) \geq t] .

C I_{u i} = q = 1 \sum d \frac{∣ a _{u q i} ∣ s _{u}}{\sum _{z = 1}^{p} ( a _{u z i} s _{u} ) ^{2}}

C I_{u i} = q = 1 \sum d \frac{∣ a _{u q i} ∣ s _{u}}{\sum _{z = 1}^{p} ( a _{u z i} s _{u} ) ^{2}}

\mbox V I P_{u} =_{i = 1, \dots, B_{1}} median C I_{u i} .

\mbox V I P_{u} =_{i = 1, \dots, B_{1}} median C I_{u i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques · Image and Object Detection Techniques

Full text

One-class classification with application to forensic analysis

Laura Anderlucci, Francesca Fortunato, Angela Montanari

Department of Statistical Sciences, University of Bologna, Italy

Abstract

The analysis of broken glass is forensically important to reconstruct the events of a criminal act. In particular, the comparison between the glass fragments found on a suspect (recovered cases) and those collected on the crime scene (control cases) may help the police to correctly identify the offender(s). The forensic issue can be framed as a one-class classification problem. One-class classification is a recently emerging and special classification task, where only one class is fully known (the so-called target class), while information on the others is completely missing. We propose to consider classic Gini’s transvariation probability as a measure of typicality, i.e. a measure of resemblance between an observation and a set of well-known objects (the control cases). The aim of the proposed Transvariation-based One-Class Classifier (TOCC) is to identify the best boundary around the target class, that is, to recognise as many target objects as possible while rejecting all those deviating from this class.

Keywords: one-class classification; transvariation probability.

1 Introduction

Burglaries and crime offences are frequently characterized by the breakage or the damage of some glass. Windows smashed vigorously to force the entry and get access to private places, lamps and bottles used to hit someone or something, glass furnitures and headlamps hurt by accident, car glasses fractured by fired bullets or collisions are just a few examples of how it may happen. As a consequence of these acts, fragments of glass scatter randomly all over the crime scene and on the offenders. In so doing, such fragments become unavoidable trace evidences and, thus, they can help the police to know more about how the crime was committed.

Usually, glass chunks arising from a breakage have a linear dimension smaller than 0.5mm; for this reason, the comparison between different fragments is often made on the basis of some analytical results: the Glass Refractive Index ( $RI$ ), measured by instrumental methods such as m-XRF, LA-ICP-MS, SEM-EDX, and the chemical composition ( $Na$ , $Mg$ , $Al$ , $Si$ , $K$ , $Ca$ , $Ba$ , $Fe$ ), measured by a scanning electron microscope.

The traditional purpose of glass analysis for forensics is to evaluate whether fragments found on the suspect (recovered cases) can be considered from the same source as those from the location at which the offence took place (control cases)(evett1987rule, ).

In the forensic science literature, this issue has been already addressed within a hypothesis testing framework by using a likelihood ratio (LR) test (see aitken2007two, ):

[TABLE]

This requires the estimation of a full model $f(\cdot|\cdot)$ for the two competing hypotheses: $H_{0}$ , the prosecution/null hypothesis that both recovered and control glasses come from the same source, and $H_{1}$ , the defence/alternative proposition that they have different origin. In equation 1 each $\cdot^{\prime}$ refers to the ratio of the elemental concentration to the oxigen, $O$ , one.

The problem of assessing whether the evidence is compatible with the control samples can also be framed as a one-class classification task. In fact, one-class classification methods aim to decide whether an object whose origin is completely unknown belongs to a particular class (the so-called “target” class, which, according to the terminology used before, includes the control cases only). As no information is available on the non-target objects, one-class classification is a difficult classification problem because it has to build a precise descriptive instead of discriminant model of the target class with enough generalisation ability (liu2016modular, ).

In tax2001one a detailed description of the methods for one-class classification tasks are discussed and presented. Several algorithms and methodologies have been proposed in the statistics literature so far. Major approaches can be casted into three groups: density methods, boundary methods and reconstruction methods.

Procedures in the first set estimate the probability density function of the target class $\chi$ , $f(x)$ , with $x\in\chi$ , and set a threshold, $t$ , on the resulting densities; in this way a target and an outlier region can be obtained. The density can be estimated via the most common density estimators: Parzen density estimators (bishop1994novelty, ; tarassenko1995novelty, ), Gaussian models (parra1996statistical, ), mixtures of Gaussians (mclachlan2000finite, ; fraley2002model, ), Kernel Density Estimation (KDE) and histograms (see scott2015multivariate, , for an exhaustive description), $K$ -nearest-neighbors (Knn) estimation (ripley2007pattern, ), just to name a few. These techniques usually work very well, especially when the sample size is sufficiently large and the model assumed to describe the target distribution is appropriate. However, their actual implementation could be limited as the choice of the best model is not trivial and it requires a large number of training objects to overcome the curse of dimensionality.

Boundary methods aim to define the best boundary around the target data, avoiding a demanding estimation of the complete density. Here, the classification issue is performed by evaluating the distance of a given object from the target class and, then, by comparing it with a threshold $t$ ; the latter is directly derived on the distance measures and adjusted to ensure a predefined sensitivity, $s$ , i.e. the proportion of target observations that are correctly identified. Boundary algorithms heavily rely on the distances between observations and, thus, they are very sensitive to the scaling of the features. In this case, although the required sample size is smaller than for density methods, the crucial task lies on the definition of appropriate distance measures. The $K$ -centers algorithm (ypma1998support, ), the $\nu$ Support Vector Classification ( $\nu$ -SVC) of scholkopf2000support and the Support Vector Data Description (SVDD) of tax2004support represent a few examples of such class of methods. In addition to these, procedures based on the concept of data depth can be added to the set (see, among others, dang2010nonparametric, ; chen2009outlier, ; ruts1996computing, ). In fact, statistical depth functions can be exploited to measure the “extremeness” or “outlyingness” of a data point with respect to a given data set as they provide center-outward ordering of multi-dimensional data. In one-class classification issues all the observations that significantly deviate from the data cloud are indeed expected to be more likely characterized by small depth values than large ones. Boundaty algorithms are completely data-driven and avoid strong distributional assumption; in addition, for a low dimensional input space, they provide intuitive visualization of the data set by finding peeling and depth contours (e.g. bagplot, convex hull, …).

Reconstruction methods are based on some assumptions about the data generating process or about the data clustering characteristics and then, describe the objects by using their reconstruction error, $\varepsilon_{reconstr}$ , that is the difference between the fitted and the observed values. Since the underlying model or structure is supposed to well represent the target class, $\varepsilon_{reconstr}$ can be considered as measure of distance from $x$ to this set. Methods in this class have not been primarily derived for one-class classification purposes, but rather to simply model the data; points that do not belong to the target class are expected to be represented worse than true target objects and, therefore, their reconstruction error is supposed to be high. Among the most common reconstruction algorithms, we can find $K$ -means (lloyd1982least, ), the Learning Vector Quantization (LVQ) by carpenter1991art , the Self-Organizing Maps (SOM) by kohonen1998self , Principal Component Analysis (PCA) and mixture of PCAs (tipping1999mixtures, ) and the autoencoders by japkowicz1995novelty .

Recent approaches include deep learning methods, such as deep neural networks, to extract common factors of variations from the data (ruff2018deep, ) and deep support vector machines (erfani2016high, ).

In this paper a novel one-class classification algorithm based on Gini’s transvariation probability as a measure of resemblance is introduced; the proposal can be framed within the context of boundary methods.

The article is organized as follows. Section 2 provides a detailed description of the glass data. In Section 3 a new procedure for one-class classification is introduced and tested in a simulation study. In Section 4, the proposed methodology is applied to the motivating example dataset. A final discussion on the obtained results concludes the paper.

2 Glass data

The glass dataset used in this paper comes from UCI repository and contains $n=138$ glass fragments, whereof 51 containers/tableware/headlamps (non-window) and 87 window (car and building) samples. Since all these observations derive from a crime scene and no fragments from potential offenders are recorded, we decide to use the window set as the target class. In other words, we derive the one-class classification rule on window objects only and we consider the non-window ones to evaluate the rule performances. These fragments are characterised by $p=9$ features: the Refractive Index and the chemical composition of 8 crucial elements, sodium ( $Na$ ), magnesium ( $Mg$ ), aluminium ( $Al$ ), silicon ( $Si$ ), potassium ( $K$ ), calcium ( $Ca$ ), barium ( $Ba$ ) and iron ( $Fe$ ). Each element is normalised to oxygen ( $O$ ) so as to remove any stochastic fluctuation in instrumental measurements. Such features exhibit a moderately high correlation, as shown in Table 1.

In order to evaluate how different the non-window are from the window samples, in Figure 1 we plot the data according to the directions with the lowest variability, i.e. according to the last two principal components computed on the target set; this representation shows that the target class (the triangles) is quite compact, while samples from the outlier one (the circles) are scattered all around.

Figure 2 shows the distributions of the features according to sample type; the variable-wise boxplots do not largely overlap, except for the $RI$ and the presence of silicon. Outlying samples exhibit overall a larger variability compared to the inlying ones.

3 The proposal

As discussed in the previous section, the goal of any one-class classifier is to define a classification rule that accepts as many target objects as possible and rejects all those significantly deviating from this class. The crucial aspect that should be stressed is that one-class algorithms learn the classification rule by using a training set composed of a single class of well-known observations that does not include any anomaly. Therefore, this issue is substantially different from a traditional two-class classification problem, where the aim is to assign data objects to one of two preliminarily defined categories. It also differs from an outlier detection task, where the training set is naturally polluted by deviant observations.

In this work, a new statistical approach for one-class classification based on Gini’s definition of transvariation probability between a group and a constant is proposed. In particular, we refer to the concept of transvariation and to some of its related measures, firstly introduced in a univariate context by gini1916concetto and, subsequently, extended to the multivariate case and to a model-based formulation by gini1943transvariazione and dagum1959transvariazione , respectively.

3.1 Transvariation probability as a measure of resemblance

The transvariation concept has proved to be very useful in the standard classification context as a measure of group separability, especially when the assumptions that justify the optimality of Fisher’s linear discriminant function are not met (montanari2004linear, ). Its applicability can be even extended to the one-class domain, as the definition of transvariation probability seems to perfectly fit the idea of resemblance between an object and a group. Moreover, this concept can be also viewed as a data depth measure, i.e. a measure of how deeply a generic observation lies in the data cloud (tukey1975, ).

According to Gini (gini1916concetto, ),

Definition 1

A group ${g}$ of $n$ units and a constant $c$ are said to transvariate on a variable $X$ , with respect to a generic mean value ${m_{X}}$ if the sign of some of the ${n}$ differences ${x_{i}-c}$ , $i=1,\cdots,n$ , is opposite to that of ${m_{X}-c}$ .

In this definition, the constant $c$ can be seen as the observed value of a degenerate group, that is, a group made of a single unit. Rephrasing such definition in the one-class domain, $c$ becomes the single unit whose resemblance with respect to the target class (namely, with $m_{X}$ ) shall be evaluated.

In order to fully understand what transvariation means, consider as an example, the three different scenarios depicted in Figure 3. In the first two, no transvariation occurs between constant $c$ (the triangle) and the mean value $m_{X}$ (the square) as all the differences $x_{i}-c$ (where $x_{i}$ is any group observation) have the same sign pattern. In the third case, on the contrary, there is evidence of transvariation: there are three points on the right-hand side whose differences with $c$ have opposite sign with respect to that of $m_{X}-c$ .

The probability that an event fulfills Definition 1 is known as transvariability, $\tau$ . $\tau$ is simply the number of transvariations over the number of possible differences,

[TABLE]

where:

$s_{X}$ is the number of units for which $(x_{i}-c)(m_{X}-c)<0$ ;

-

$s^{\prime}_{X}$ is the number of units for which $(x_{i}-c)(m_{X}-c)=0$ ;

-

$n$ is the number of differences $(x_{i}-c)$ .

If we assume $m_{X}$ to be the median (as Gini did), the maximum of $\tau$ , $\tau_{M}$ , is $\frac{1}{2}$ . Therefore, the definition of transvariation probability of a constant $c$ , $tp(c)$ , with respect to a group represented by its median is:

[TABLE]

Values close to 1 reflect a high resemblance of $c$ to the target class.

When the probability density function of the target class is known or can be estimated, an analogous version of transvariability ( $\tau_{f}$ ) that exploits such information can be derived:

[TABLE]

where $F(c)$ is the cumulative distribution function of the target class evaluated in $c$ . Assuming $m_{X}$ to be the median, its maximum is still $\frac{1}{2}$ . The resulting computation of transvariation probability is:

[TABLE]

3.1.1 Extension to the multivariate case

Transvariation probability allows for extensions to more than one variable. Specifically, following gini1943transvariazione , in the multivariate case, the definition of transvariability $\tau$ corresponds to the joint probability that an event fulfills Definition 1:

[TABLE]

where

$s_{\mathbf{X}}$ is the number of units for which $({x}_{iu}-{c_{u}})({m}_{u}-{c_{u}})<0$ for all the variables $u=1,\ldots,p$ ;

-

$s^{\prime}_{\mathbf{X}}$ is the number of units for which $({x}_{iu}-{c_{u}})({m}_{u}-{c_{u}})=0$ for all the variables $u=1,\ldots,p$ ;

-

$n$ is the number of differences $({x}_{iu}-{c_{u}})$ .

If we assume

[TABLE]

to be the multivariate spatial median or mediancentre (mediancentre, ), i.e. ${\mathbf{m}}_{\mathbf{X}}$ is the vector that minimizes $\sum_{n}d({\mathbf{x}},{\mathbf{m}}_{\mathbf{X}})$ , where $d({\mathbf{x}},{\mathbf{m}}_{\mathbf{X}})$ is the distance between ${\mathbf{x}}$ and ${\mathbf{m}}_{\mathbf{X}}$ , the maximum $\tau_{M}$ may no longer be $\frac{1}{2}$ and it needs to be estimated. In particular, $\tau_{M}$ can be computed as $\tau$ in equation 6 on the shifted data ${\mathbf{Y}}={\mathbf{X}}-({\mathbf{m}}_{\mathbf{X}}-{\mathbf{c}})$ . Therefore, the multivariate definition of transvariation probability is:

[TABLE]

Equation 4 can be extended to the multidimensional case as well. Given that $\tau_{M}$ may no longer be $\frac{1}{2}$ , the expression of (5) becomes:

[TABLE]

where $f({\mathbf{x}})$ is the probability density function (pdf) of the target class and

$a_{{\mathbf{x}}_{u}\phantom{M}}=\begin{cases}c_{u}&\mbox{if }c_{u}\geq m_{u}\\ -\infty&\mbox{if }c_{u}<m_{u}\end{cases}$ ,

-

$b_{{\mathbf{x}}_{u}\phantom{M}}=\begin{cases}+\infty&\mbox{if }c_{u}\geq m_{u}\\ c_{u}&\mbox{if }c_{u}<m_{u}\end{cases}$ ,

-

$a_{M{\mathbf{x}}_{u}}=\begin{cases}m_{u}&\mbox{if }c_{u}\geq m_{u}\\ -\infty&\mbox{if }c_{u}<m_{u}\end{cases}$ ,

-

$b_{M{\mathbf{x}}_{u}}=\begin{cases}+\infty&\mbox{if }c_{u}\geq m_{u}\\ m_{u}&\mbox{if }c_{u}<m_{u}\end{cases}$ ,

for $u=1,\dots,p$ . Obviously, when the variables involved in the computation can be assumed to be independent, the multivariate transvariation probability reduces to the product of the simple univariate ones:

[TABLE]

where $tp(c_{u})$ is the univariate marginal transvariation probability corresponding to the $u$ -th variable, computed either by (3) or (5).

3.2 Transvariation-based One-Class Classifier (TOCC)

In this paper, a new one-class classification method based on transvariation probability, called Transvariation-based One-Class Classifier (TOCC), is introduced. In particular, we shall refer to TOCCdf if the transvariation probability is computed according to (7) and thus it is density-free; coherently, we would refer to TOCCdb when considering equation (8), as it is density-based.

The classification rule of the TOCCdf [TOCCdb] is obtained through the following steps:

Set a value, $s$ , as the desired minimum sensitivity of the one-class classifier; 2. 2.

For each unit ${\mathbf{c}}$ compute its transvariation probability $tp(\mathbf{c})$ [ $tp_{f}(\mathbf{c})$ ] with respect to the target group median, ${\mathbf{m}}_{\mathbf{X}}$ ; 3. 3.

Use the $s-th$ percentile of the distribution of transvariation probabilities as a threshold, ${t}$ , for the one-class classifier.

For a new test sample ${\mathbf{z}}$ , its transvariation probability, $tp(\mathbf{z})$ [ $tp_{f}(\mathbf{z})$ ], with respect to $\mathbf{m}_{\mathbf{X}}$ is computed. Then, ${\mathbf{z}}$ is assigned to the target set if

[TABLE]

In order to visualize how the TOCCs work in practice, consider Figure 4. In the plot, target glass samples are colored in different shades of gray, according to the level of their transvariation probabilities, $tp(\mathbf{c})$ , with respect to the target group median, ${\mathbf{m}}_{\mathbf{X}}$ (the cross). As expected, moving away from ${\mathbf{m}}_{\mathbf{X}}$ , the magnitude of transvariation probability decreases. In particular, by setting $s=0.90$ , all the objects with a value of $tp(\mathbf{c})$ smaller than the threshold ${\mathbf{t}}$ , are classified as (false) negative (i.e. the stars).

Consider again Figure 1. As it can be easily noticed, the triangle cloud (i.e. the target class) is polluted by several non-target objects. As the TOCC can be seen as a data depth measure, it tends to ‘peel’ the target set and, therefore, it may fail to detect those deviating observations that do not lie on the external border. In order to improve this procedure, and inspired by those algorithms that use a set of prototypes to represent the input data (e.g. $K$ -means, SOM, …), a modified version of the TOCCdf is introduced.

The idea is to combine the TOCCdf with the clustering information on the target class provided by Partitioning Around Medoids, PAM (kaufman1990, ). Each cluster is analysed separately; as a result, the PAM-TOCCdf returns a set of thresholds, rather than a single one. In so doing, the algorithm is capable to detect those deviating observations that are scattered within the target set.

Figure 5 shows the two different solutions yielded by the the TOCCdf and the PAM-TOCCdf. As discussed, the TOCCdf (left panel) is able to identify only those deviating points placed on the target class perimeter. For this reason, such procedure is suggested when there is no evidence of strong overlap between the two sets. In all the other situations, the PAM-TOCCdf (right panel) should be preferred: as clearly displayed, this algorithm is able to detect non-target objects that deviate along different directions.

The following steps outline the PAM-TOCCdf two-phases process:

Phase I:
(a)

run the PAM algorithm on the target class, with a number of clusters $K$ chosen beforehand; store the resulting information on both the group membership and the prototype vectors. 2. Phase II: for each cluster $k$ ,

(a)

set a value, $s_{k}$ , as the desired minimum sensitivity of the one-class classifier (generally, $s_{k}$ is set equal $\forall k$ ); 2. (b)

for each unit ${\mathbf{c}}$ in the $k$ -th cluster compute its transvariation probability $tp(\mathbf{c})$ with respect to the group prototype, ${}_{k}{\mathbf{m}}_{{\mathbf{X}}}$ . As ${\mathbf{m}}_{\mathbf{X}}$ is no longer the median, but the cluster centroid, there is no guarantee that $\tau_{M}$ is equal to $\frac{1}{2}$ . For this reason, the transvariation probability should be computed according to equation 7, in both the univariate and the multivariate contexts; 3. (c)

use the $s_{k}-th$ percentile of the (increasing) ordered distribution of transvariation probabilities as a threshold, ${}_{k}{\mathbf{t}}$ , for the one-class classifier.

A new sample ${\mathbf{z}}$ is firstly assigned to the closest group $g$ . Then, its transvariation probability, $tp(\mathbf{z})$ with respect to ${}_{g}{\mathbf{m}}_{\mathbf{X}}$ , is computed. The final decision on ${\mathbf{z}}$ is carried out according to the rule described in (9), where ${t}={{}_{k=g}{t}}$ .

3.3 Practical considerations

The computational cost of the TOCCs increases with the number of features $p$ involved in the problem at hand.

For the TOCCdf this relationship is (at most) linear: the algorithm examines one variable at a time and, thus, it requires the calculation of (at most) $n\times p$ differences $({x}_{iu}-{c_{u}})({m}_{u}-{c_{u}}),\ i=1,\ldots,n$ , $u=1,\ldots,p$ , in order to decide whether the object $\mathbf{c}$ transvariates.

In the case of the TOCCdb, the area under the curve is split into $2^{p}$ regions, identified at the intersection of the $p$ axes that originate from the spatial median, ${\mathbf{m}}_{\mathbf{X}}=(m_{1},\dots,m_{p})$ , as shown in Figure 6.

Differently from the TOCCdf, the TOCCdb is not a step-wise procedure, as it considers all the variables together (see equation 8). However, the cost of the algorithm increases exponentially with $p$ , since $2^{p}$ regions must be defined; unfortunately, this step is not scalable.

For these reasons, preliminary dimension reduction or variable selection procedures may be convenient in order to handle the classification task efficiently. In the following, several strategies are outlined.

3.3.1 Dimension reduction and variable selection

For dimension reduction, the classical Principal Component Analysis (PCA) or its sparse version (sPCA) introduced by zou2006sparse proved to produce good results in the one-class framework, especially when only the low-variance projections are retained (tax2003feature, ). In fact, such directions turned out to be the most informative ones for the one-class classification problem, since they provide the tightest description of the target set.

Besides PCA, the Random Projection (RP) method represents a valid alternative for reducing the data dimensionality. In the context of supervised classification, samworth proposed an ensemble method that identifies the best $B_{1}$ RPs according to the smallest misclassification error rate. Within the one-class classification framework, a similar approach can be implemented. In this context the information on non-target objects is unavailable or vague, therefore a possible solution is to select those RPs that minimise the Median Absolute Deviation (MAD) of the projected data. Coherently with the definition of transvariation probability in (1), such strategy provides indeed the most compact projection of the target set with respect to its median. The resulting classification vectors are then aggregated through a majority vote scheme.

To deal with the variable selection task, many approaches have been developed in the model-based clustering and classification framework, e.g. scrucca2014clustvarsel , murphy2010variable and mclachlan2005analyzing . Among them, varSel algorithm introduced by sartori uses Gaussian Mixtures to identify the most suitable variables for classification (and clustering) purposes.

Random projections can also be exploited to perform variable selection. The input features could be ranked according to a modified version of the Importance Coefficient (CI) introduced by montanari2001projection in the context of projection pursuit. For the generic $d$ -dimensional RP, the CI of the u-th variable is computed as:

[TABLE]

where $a_{uqi}$ indicates the attribute u coefficient in the q-th vector of the d-dimensional random projection solution $i$ and $s_{u}$ the variability (i.e. the standard deviation) of each attribute. Since $B_{1}$ random projections are available, the overall importance measure for each variable can be derived as the median CI across projections and it is called Variable Importance in Projection (VIP):

[TABLE]

The median is used here so as to mitigate the effects of potential not-so-good projections on the VIP. The number of variables to be kept is decided by the user.

The presence of highly associated input features pollutes the capability of the VIP to detect those actually relevant since, by its nature, it tends to assume approximately the same value for very correlated variables. Thus, a specific correction procedure for this measure is advisable in order to mitigate the correlation effect.

A possible strategy is to retain the variables with the highest VIP value whilst discarding those that strongly correlate, on average, with the variables already considered; i.e. those that exhibit an average absolute correlation $\bar{\rho}$ larger than a given threshold, $\kappa$ . From our empirical experience, a reasonable interval for $\kappa$ would be $0.4-0.7$ , depending on the average degree of the association in the original data: the strongest the association, the lower is the threshold. We shall refer to the adjusted-for-correlation VIP as the $\kappa-$ VIP.

3.4 Simulated examples

The performances of the TOCCs have been evaluated in an extensive simulation study. In each of the simulation settings described below, target ( $\chi$ ) objects are generated according to different bivariate distributions, so as to visualise how the proposals work in practise. Non-target data ( $\Upsilon$ ) are considered to evaluate the performances of the classification rules learned on $\chi$ only.

For the first four scenarios, the mean vector of non-target data is obtained by shifting the mean vector of target objects. The magnitude of the shift is described by a non-centrality parameter, called $\lambda$ ; different magnitudes (i.e. $\lambda=1$ , small shift; $\lambda=2$ , large shift) are considered.

In the first scenario, we simulate target objects from a bivariate Gaussian distribution, whose components are standard normal random variables with a correlation equal to 0.35. 2. 2.

Second scenario considers a skew target class, i.e. the squared bivariate Gaussian distribution of scenario (a) is used as generative model. 3. 3.

Differently, in the third scenario, target data are generated by taking the the square root of the absolute bivariate Gaussian distribution in scenario (a). 4. 4.

In scenario four, data are drawn from the logarithm of the bivariate Gaussian distribution in scenario (a).

Further settings have been explored, i.e. scenarios (e)-(h), so as to evaluate the behaviour of the TOCCs in the presence of non-target objects uniformly scattered within a box over the target class. The size of the box is determined by the target data itself; basically, the center of the box is the median of the features, and the sides are 3 times the interquartile range of each dimension. The same distributions of scenarios (a)-(d) are considered as target class.

An additional scenario (i) with non-standard data shape is also evaluated. Specifically, in this case, both target and non-target objects are generated according to a bivariate banana-shaped distribution with different angle widths.

For each scenario, different sizes of the target class, $n_{T}$ , are considered (i.e. 100, 200, 500); non-target class size, $n_{NT}$ , is always taken to be $0.5n_{T}$ . For each setting, 100 repetitions are run and results are compared with several state-of-the-art one-class classifiers.

In particular, these methods include the Gaussian model (Gauss, implemented using the mahalanobis function), the Mixture of Gaussians approach (Mix-Gauss, implemented using the mclust package (see mclust, ), where the optimal number of components, ranging from 1 to 9, was chosen so as to maximize the BIC), the kernel density estimate (KDE, implemented using the ks package with the normal kernel and the unconstrained plug-in bandwidth selector), the $K$ -means algorithm (KM, implemented using the kmeans function with $K=5$ clusters), the 2-dimensional self organizing map (SOM, implemented using the kohonen package with a $5\times 5$ grid and a learning rate $\alpha=(0.5,0.3)$ ) and the support vector data description (SVDD, implemented using the svdd package, with a cost parameter for the positive examples $C=0.1$ ).

Mixtures of Gaussians are fitted to the data for the TOCCdb in each scenario. The PAM-TOCCdf has run with a number of clusters $K=5$ , coherently with the settings of the competing methods.

Figures 7 and 8 contain the aggregated results for each scenario. The boxplots show the behaviour of the specificity rates for $s\geq 0.9$ sensitivity level; the horizontal line helps the comparison among the approaches, by highlighting the median specificity for the TOCCdf.

Results coming from this study clearly show the general effectiveness of the transvariation-based one-class classifier we introduced. In particular, for all the simulated models, the algorithms attain specificity rates that are always better than or, at least, comparable with those from the state-of-the-art methods. These promising outcomes allow to efficiently use the proposed procedures in a wide variety of problems.

A separate evaluation should be carried out for the PAM-TOCCdf; the performances of this classifier strongly depend on the behavior of the non-target observations. In fact, as clearly depicted in the boxplots of Figure 7, it tends to outperform the other methods especially when the detection problem is particularly difficult, that is, when non-target observations pollute the core of target set and do not limitedly lie on its external perimeter.

Boxplots in Figure 8 exhibit a generally improved performance for almost all the methods in the presence of non-target samples uniformly scattered over the target set: overall, the median specificity for a sensitivity level $s\geq 0.9$ is above 75%. Also in these scenarios, the PAM-TOCCdf is able to globally detect the largest number of deviating observations.

Among the considered state-of-the art methods, the KDE appears to perform poorly almost everywhere. This is probably due to a wrong specification of the bandwidth matrix $H$ for the non-target class: $H$ is estimated only on the target set and, therefore, the kernel $\varphi_{H}(.)$ is likely to produce incorrect estimates for the observations that differ too much from this class.

A special mention should be made for the results of the last scenario, depicted in Figure 9. In general, the non-convexity of the banana-shaped data appears very hard to be detected, particularly by the less flexible methods. In such situations, the most adaptive procedures (i.e. PAM-TOCCdf, Mix-Gauss and SOM) handle the “non-typicality” of the target class distribution more appropriately.

4 Glass data analysis

The analysis of the glass fragments is carried out by the TOCCs proposed and described in the previous sections. Preliminarily, dimension reduction and variable selection procedures are applied, as suggested in Section 3.3.

PCA is computed on the window fragments and the last two components are retained. For the RP method, the best $B_{1}=101$ bi-dimensional projections are considered, each carefully chosen within $B_{2}=50$ possible solutions.

About the variable selection procedures, the first two most important features according to both the VarSel and the VIP algorithms are considered; in particular, given the moderately high degree of association (see Table 1), the adjusted-for-correlation VIP is applied, with a threshold $\kappa=0.5$ .

The bi-dimensional target data representation of Figure 1 shows an approximately elliptical shape that suggests to consider a mixture of Gaussians as the reference model for this class. As the chemical composition of the two sets of fragments is similar, we can expect them to be (at least) partially overlapping; for this reason, the PAM-TOCCdf is run with a number of clusters moderately large compared to the number of units, i.e. $K=4$ .

Figure 10 depicts the ROC curves for the three TOCCs, distinguished by the different strategies implemented to reduce the data dimensionality; Table 2 contains the corresponding area under the ROC curve (AUC). Overall results are very good, as almost all the non-window fragments have been recognised. However, a few considerations can still be made. In particular, for this set of data variable selection procedures slightly outperform the dimension reduction ones; plots in the second row exhibit a quasi-perfect performance. As shown in Figure 11, the two sets of fragments look well separated when plotted according to the most relevant features, even if these are different for the two methods (varSel chose potassium and magnesium, whilst $\kappa$ -VIP selected silicon and magnesium). The goodness of such selections allows all the TOCCs to perform excellently.

When the characteristics of the target and non-target objects are not so easily distinguishable (see, Figure 1), the PAM-TOCCdf should be preferred; this method is, by construction, more capable to identify the non-window glasses scattered within the window samples; in addition, it requires the lowest computational time, as shown in Table 3.

5 Discussion and conclusions

In this work, new directions for forensic analysis of glass fragments have been considered. In particular, the problem of identifying glass samples that come from different sources in a crime scene has been addressed for the first time (to the best of our knowledge) within a one-class classification framework.

We proposed to consider transvariation probability as a measure of resemblance between an observation and a set of well-known objects. Basing on tp, three different algorithms have been introduced, according to the available information on the target set. Namely, TOCCdf is a distribution-free method that only relies on the computation of the transvariation probability. When information on the distributional shape of the target units is available, a distribution-based TOCC, TOCCdb, can be successfully implemented. These methods perform very well, especially when non-target objects lie on the external perimeter of the target class.

However, information on the deviating samples is, in principle, not available and the situation just described may not be realistic as non-target units can actually pollute the target set intrinsically. For this reason, a more flexible method that allows to peel the target objects within the data cloud has been developed. The PAM-TOCCdf identifies homogeneous groups of target samples and exploits such information to spot the units that deviate from each cluster.

The performances of the proposed method have been evaluated in terms of specificity, i.e. the proportion of actual negatives that are correctly predicted, on multiple synthetic datasets. Simulation results demonstrate that the use of $tp$ as a tool for one-class classification outperforms several state-of-the-art methods.

The chemical composition of the two sets of glass fragments that motivate our work is very similar and the samples cannot be easily distinguished. For this reason, the PAM-TOCCdf appears to be the most appropriate transvariation-based one-class classifier, being able to detect all the non-window objects. The methodology we propose is very flexible and can be employed to solve different one-class classification tasks, such as food authentication, fraud detection, central statistical monitoring issues, to name a few. In (amsdottorato8412, ) excellent performances achieved by the TOCCs on other datasets are shown. In particular, the proposed classifier has been applied to two sets of near infrared spectroscopic food data, in order to evaluate food samples’ authenticity (namely, one related to honey samples and the other concerning olive oil). In addition, the Water Treatment Plant dataset from the UCI repository was successfully explored in a fault detection perspective. This dataset is well-known in the literature as a difficult classification task, since no method turned out to be able to correctly identify the days in which the plant wrongly operated.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Colin GG Aitken, Grzegorz Zadora, and David Lucy. A two-level model for evidence evaluation. Journal of forensic sciences , 52(2):412–419, 2007.
2[2] Christopher M Bishop. Novelty detection and neural network validation. IEEE Proceedings-Vision, Image and Signal processing , 141(4):217–222, 1994.
3[3] Timothy I Cannings and Richard J Samworth. Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 79(4):959–1035, 2017.
4[4] Gail A Carpenter, Stephen Grossberg, and David B Rosen. Art 2-a: An adaptive resonance algorithm for rapid category learning and recognition. Neural networks , 4(4):493–504, 1991.
5[5] Yixin Chen, Xin Dang, Hanxiang Peng, and Henry L Bart. Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence , 31(2):288–305, 2009.
6[6] Camillo Dagum. Transvariazione fra più di due distribuzioni. Gini, C.(ed.) Memorie di metodologia statistica , 2, 1959.
7[7] Xin Dang and Robert Serfling. Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference , 140(1):198–213, 2010.
8[8] Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie. High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning. Pattern Recognition , 58:121–134, 2016.