Pointed subspace approach to incomplete data

{\L}ukasz Struski; Marek \'Smieja; Jacek Tabor

arXiv:1705.00840·cs.LG·May 3, 2017

Pointed subspace approach to incomplete data

{\L}ukasz Struski, Marek \'Smieja, Jacek Tabor

PDF

Open Access

TL;DR

This paper introduces a novel representation of incomplete data as pointed affine subspaces, enabling affine transformations and embedding into vector space for improved classification handling of missing data.

Contribution

It generalizes the traditional missing data representation by using pointed affine subspaces and provides a method to embed these into vector space while preserving scalar products.

Findings

01

Allows affine transformations of incomplete data

02

Enables embedding into vector space for classification

03

Preserves scalar products in the embedding

Abstract

Incomplete data are often represented as vectors with filled missing attributes joined with flag vectors indicating missing components. In this paper we generalize this approach and represent incomplete data as pointed affine subspaces. This allows to perform various affine transformations of data, as whitening or dimensionality reduction. We embed such generalized missing data into a vector space by mapping pointed affine subspace (generalized missing data point) to a vector containing imputed values joined with a corresponding projection matrix. Such an operation preserves the scalar product of the embedding defined for flag vectors and allows to input transformed incomplete data to typical classification methods.

Tables3

Table 1. Table 1: Mean accuracies for a classification of UCI data sets with randomly missing attributes.

data	embedding	zero	mean	median	most probable
BC	no information	$0.71 \pm 0.02$	$0.71 \pm 0.04$	$0.73 \pm 0.04$	$0.76 \pm 0.02$
BC	subspace	$0.73 \pm 0.03$	$0.73 \pm 0.05$	$0.74 \pm 0.04$	$0.77 \pm 0.02$
IS	no information	$0.63 \pm 0.02$	$0.67 \pm 0.02$	$0.67 \pm 0.02$	$0.67 \pm 0.03$
IS	subspace	$0.65 \pm 0.02$	$0.67 \pm 0.02$	$0.67 \pm 0.03$	$0.67 \pm 0.02$
Y	no information	$0.49 \pm 0.02$	$0.52 \pm 0.01$	$0.52 \pm 0.01$	$0.52 \pm 0.01$
Y	subspace	$0.5 \pm 0.02$	$0.52 \pm 0.01$	$0.52 \pm 0.02$	$0.53 \pm 0.01$

Table 2. Table 2: Mean accuracies for a classification of UCI data sets with structural attribute absence.

data	embedding	zero	mean	median	most probable
BC	no information	$0.74 \pm 0.03$	$0.73 \pm 0.03$	$0.73 \pm 0.02$	$0.76 \pm 0.02$
BC	subspace	$0.76 \pm 0.03$	$0.76 \pm 0.02$	$0.76 \pm 0.03$	$0.78 \pm 0.02$
IS	no information	$0.66 \pm 0.02$	$0.67 \pm 0.03$	$0.69 \pm 0.03$	$0.69 \pm 0.03$
IS	subspace	$0.71 \pm 0.03$	$0.70 \pm 0.04$	$0.71 \pm 0.02$	$0.72 \pm 0.02$
Y	no information	$0.61 \pm 0.03$	$0.52 \pm 0.01$	$0.52 \pm 0.01$	$0.52 \pm 0.01$
Y	subspace	$0.62 \pm 0.04$	$0.56 \pm 0.02$	$0.59 \pm 0.02$	$0.56 \pm 0.02$

Table 3. Table 3: Mean accuracies for a classification of medical data.

	zero	mean	median	most probable
no information	$0.82 \pm 0.03$	$0.81 \pm 0.02$	$0.81 \pm 0.03$	$0.81 \pm 0.02$
subspace	$0.82 \pm 0.01$	$0.83 \pm 0.02$	$0.83 \pm 0.02$	$0.83 \pm 0.01$

Equations56

⟨(x, J_{x}), (y, K_{y})⟩ = ⟨ x, y ⟩ + ⟨ \mathds 1_{J_{x}}, \mathds 1_{K_{y}} ⟩ .

⟨(x, J_{x}), (y, K_{y})⟩ = ⟨ x, y ⟩ + ⟨ \mathds 1_{J_{x}}, \mathds 1_{K_{y}} ⟩ .

x + span (e_{j})_{j \in J_{x}},

x + span (e_{j})_{j \in J_{x}},

F (x + V) = F (x) + A V .

F (x + V) = F (x) + A V .

⟨(x, \mathds 1_{J_{x}}), (y, \mathds 1_{K_{y}})⟩ = ⟨(x, p_{span (e_{J} : j \in J_{x})}), (y, p_{span (e_{k} : k \in K_{y})})⟩ .

⟨(x, \mathds 1_{J_{x}}), (y, \mathds 1_{K_{y}})⟩ = ⟨(x, p_{span (e_{J} : j \in J_{x})}), (y, p_{span (e_{k} : k \in K_{y})})⟩ .

f (x + V) = {A w + b : w \in x + V} .

f (x + V) = {A w + b : w \in x + V} .

f (x + V) - f (x) = A V .

f (x + V) - f (x) = A V .

f (x + V) = (A x + b) + A V,

f (x + V) = (A x + b) + A V,

p_{V} (y) = j \in J \sum ⟨ y, v_{j} ⟩ v_{j} = j \in J \sum v_{j} v_{j}^{T} y = (j \in J \sum v_{j} v_{j}^{T}) y,

p_{V} (y) = j \in J \sum ⟨ y, v_{j} ⟩ v_{j} = j \in J \sum v_{j} v_{j}^{T} y = (j \in J \sum v_{j} v_{j}^{T}) y,

p_{V} = j \in J \sum v_{j} v_{J}^{T} .

p_{V} = j \in J \sum v_{j} v_{J}^{T} .

Φ (S_{x}) \in S_{x},

Φ (S_{x}) \in S_{x},

x_{V^{⊥}} = x - p_{V} (x) = x - j \in J \sum ⟨ x_{j}, v_{j} ⟩ v_{j},

x_{V^{⊥}} = x - p_{V} (x) = x - j \in J \sum ⟨ x_{j}, v_{j} ⟩ v_{j},

x_{V}^{(m, Σ)} = x + p_{V}^{Σ} (m - x),

x_{V}^{(m, Σ)} = x + p_{V}^{Σ} (m - x),

S_{x} \to (x, p_{V}) \in R^{N} \times R^{N \times N},

S_{x} \to (x, p_{V}) \in R^{N} \times R^{N \times N},

Whitening (x) = Σ^{- 1/2} (x - m),

Whitening (x) = Σ^{- 1/2} (x - m),

Whitening (x + V) = Σ^{- 1/2} (x - m) + Σ^{- 1/2} V .

Whitening (x + V) = Σ^{- 1/2} (x - m) + Σ^{- 1/2} V .

PCA (x) = W^{T} (x - m),

PCA (x) = W^{T} (x - m),

PCA (x + V) = W^{T} (x - m) + W^{T} V .

PCA (x + V) = W^{T} (x - m) + W^{T} V .

⟨ x + V, y + W ⟩ = ⟨ x, y ⟩ + ⟨ p_{V}, p_{W} ⟩ .

⟨ x + V, y + W ⟩ = ⟨ x, y ⟩ + ⟨ p_{V}, p_{W} ⟩ .

⟨ x + V, y + W ⟩_{D} = ⟨ x, y ⟩ + D ⟨ p_{V}, p_{W} ⟩ .

⟨ x + V, y + W ⟩_{D} = ⟨ x, y ⟩ + D ⟨ p_{V}, p_{W} ⟩ .

V = span (v_{j} : j \in J), W = span (w_{j} : j \in K) .

V = span (v_{j} : j \in J), W = span (w_{j} : j \in K) .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2} .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2} .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum tr ((v_{j} v_{j}^{T})^{T} (w_{k} w_{k}^{T})) .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum tr ((v_{j} v_{j}^{T})^{T} (w_{k} w_{k}^{T})) .

tr ((v_{j} v_{j}^{T})^{T} (w_{k} w_{k}^{T})) = tr (v_{j} v_{j}^{T} w_{k} w_{k}^{T}) = tr (v_{j}^{T} w_{k} w_{k}^{T} v_{j}) = (v_{j}^{T} w_{k}) \cdot (w_{k}^{T} v_{j}) = ⟨ v_{j}, w_{k} ⟩^{2} .

tr ((v_{j} v_{j}^{T})^{T} (w_{k} w_{k}^{T})) = tr (v_{j} v_{j}^{T} w_{k} w_{k}^{T}) = tr (v_{j}^{T} w_{k} w_{k}^{T} v_{j}) = (v_{j}^{T} w_{k}) \cdot (w_{k}^{T} v_{j}) = ⟨ v_{j}, w_{k} ⟩^{2} .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2} .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2} .

⟨ x + V, y + W ⟩_{D} = ⟨ x, y ⟩ + D i, j \sum (p_{V})_{ij} (p_{W})_{ij} = ⟨ x, y ⟩ + D j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2},

⟨ x + V, y + W ⟩_{D} = ⟨ x, y ⟩ + D i, j \sum (p_{V})_{ij} (p_{W})_{ij} = ⟨ x, y ⟩ + D j \in J, k \in K \sum ⟨ v_{j}, w_{k} ⟩^{2},

(x, J) \to (x, \mathds 1_{J}) \in R^{N} \times R^{N} .

(x, J) \to (x, \mathds 1_{J}) \in R^{N} \times R^{N} .

⟨(x, \mathds 1_{J},) (y, \mathds 1_{K})⟩ = ⟨ x, y ⟩ + ⟨ \mathds 1_{J}, \mathds 1_{K} ⟩ = ⟨ x, y ⟩ + card (J \cap K) .

⟨(x, \mathds 1_{J},) (y, \mathds 1_{K})⟩ = ⟨ x, y ⟩ + ⟨ \mathds 1_{J}, \mathds 1_{K} ⟩ = ⟨ x, y ⟩ + card (J \cap K) .

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ e_{j}, e_{k} ⟩^{2} = l \in J \cap K \sum ⟨ e_{l}, e_{l} ⟩^{2} = l \in J \cap K \sum 1 = card (J \cap K),

⟨ p_{V}, p_{W} ⟩ = j \in J, k \in K \sum ⟨ e_{j}, e_{k} ⟩^{2} = l \in J \cap K \sum ⟨ e_{l}, e_{l} ⟩^{2} = l \in J \cap K \sum 1 = card (J \cap K),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRough Sets and Fuzzy Logic · Data Mining Algorithms and Applications · Data Management and Algorithms

Full text

\DeclareMathOperator\arccot

arccot

\coltauthor\NameŁukasz Struski \[email protected]

\NameMarek Śmieja \[email protected]

\NameJacek Tabor \[email protected]

\addrJagiellonian University, Kraków, Poland

Pointed subspace approach to incomplete data

Abstract

Incomplete data are often represented as vectors with filled missing attributes joined with flag vectors indicating missing components. In this paper we generalize this approach and represent incomplete data as pointed affine subspaces. This allows to perform various affine transformations of data, as whitening or dimensionality reduction. We embed such generalized missing data into a vector space by mapping pointed affine subspace (generalized missing data point) to a vector containing imputed values joined with a corresponding projection matrix. Such an operation preserves the scalar product of the embedding defined for flag vectors and allows to input transformed incomplete data to typical classification methods.

keywords:

incomplete data, SVM, linear transformations

1 Introduction

Incomplete data analysis is an important part of data engineering and machine learning, since it appears in many practical problems. In medical diagnosis, a doctor may be unable to complete the patient examination due to the deterioration of health status or lack of patient’s compliance (Burke et al., 1997); in object detection, the system has to recognize the shape from low resolution or corrupted images (Berg et al., 2005); in chemistry, the complete analysis of compounds requires high financial costs (Stahura and Bajorath, 2004). In consequence, the understanding and the appropriate representation of such data is of great practical importance.

A missing data is typically viewed as a pair $(x,J_{x})$ , where $x\in\mathbb{R}^{N}$ is a vector with missing components $J_{x}\subset\{1,\ldots,N\}$ . In the most straightforward approach, one can fill missing attributes with some statistic, e.g. mean, taken from existing data. Although such a strategy can be partially justified when the features are missing at random, we lose the knowledge about unknown attributes111In the medical data, typically some component is missing if the state of the patient is so bad, that a given numerical procedure cannot be performed. Consequently, the knowledge that given component is missing could say a lot about the state of the patient.. To preserve this information we usually add a flag indicating which components were missing. More precisely, we supply $x$ with a binary vector $\mathds{1}_{J_{x}}$ , in which 1 denotes absent feature while 0 means the present one.

Summarizing, we perform the embedding $(x,J_{x})\to(x,\mathds{1}_{J_{x}})$ of missing points into a vector space of extended complete data. This allows us to apply typical classification tools, like SVM, with the scalar product defined by

[TABLE]

In practical classification problems we usually perform various affine transformations of data, as whitening or dimensionality reduction, before training a classifier. Moreover, we may know that the data satisfy some affine constraint. It is nontrivial how to modify the flag vectors so as to keep the correspondence with such affine transformations. Thus, our main problem behind the paper can be stated as follows: How to transform the flag vectors indicating the missing components if we perform the linear (or affine) mapping of data?

In this contribution, we show that the answer can be given by viewing the incomplete data as pointed affine subspaces, i.e. the subspace with a distinguished point called basepoint. We first observe that a pair $(x,J_{x})$ can be formally associated with a pointed affine subspace of $\mathbb{R}^{N}$ :

[TABLE]

where $(e_{j})_{j=1}^{N}$ denotes the canonical base of $\mathbb{R}^{N}$ and $x$ is a selected basepoint. In other words, this is the set of all points which coincide with the representative $x$ on the coordinates different from $J_{x}$ . In consequence, by a generalized missing data point in $\mathbb{R}^{N}$ we understand a pointed affine subspace $S_{x}=x+V$ of $\mathbb{R}^{N}$ , where $x\in\mathbb{R}^{N}$ is a basepoint and $V=S_{x}-x$ is a linear subspace. Since the basepoint can be selected with a use of various imputation techniques, we propose to choose the most probable point of $S_{x}$ , i.e. to project a dataset mean onto $S_{x}$ with respect to Mahalanobis scalar product given by the covariance of data.

Such a definition allows us to efficiently extend linear and affine operations from the standard points to missing ones, by taking the image of the subspace and the point. For example, a linear mapping $F:w\to Aw+b$ , can be extended to the case of pointed subspace $x+V$ by

[TABLE]

Given an affine constraint $W$ , we restrict222Observe that if such a constraint $W$ is given the augmentation of the missing components must be performed in such a way as to choose the representation in $W$ , and consequently we may assume that $x\in W$ . $x+V$ by the formula $(x+V)\cap W=x+(V\cap(W-x))$ .

There appears another question: how to work with such data, and in particular how to embed the generalized missing data into a vector space in such a way to respect the scalar product \eqrefeq:standard given by the flag embedding? Our main observation shows that this can be achieved by identifying a linear subspace $V$ with an orthogonal projection $p_{V}:\mathbb{R}^{N}\to V$ by considering the embedding $(x,V)\to(x,p_{V})\in\mathbb{R}^{N}\times\mathbb{R}^{N\times N}$ . We show that the scalar product of embeddings coincides with \eqrefeq:standard, i.e.

[TABLE]

The paper is organized as follows. The next section covers the related approaches to incomplete data analysis. In third section, we define the generalized missing data, present a strategy of embedding such data into a vector space and propose a new imputation method. We also define a scalar product for such embeddings and show its connections with existing flag approach. In fourth section, we illustrate our method with sample classification results.

2 Related works

The most common approach to learning from incomplete data is known as deterministic imputation (McKnight et al., 2007). In this two-step procedure, the missing features are filled first, and only then a standard classifier is applied to the complete data (Little and Rubin, 2014). Although the imputation-based techniques are easy to use for practitioners, they lead to the loss of information which features were missing and do not take into account the reasons of missingness. To preserve the information of missing attributes, one can use an additional vector of binary flags, which was discussed in the introduction.

The second popular group of methods aims at building a probabilistic model of incomplete data which maximizes the likelihood by applying the EM algorithm (Ghahramani and Jordan, 1994; Schafer, 1997). This allows to generate the most probable values from obtained probability distribution for missing attributes (random imputation) (McKnight et al., 2007) or to learn a decision function directly based on the distributional model. The second option was already investigated in the case of linear regression (Williams et al., 2005), kernel methods (Smola et al., 2005; Williams and Carin, 2005) or by using second order cone programming (Shivaswamy et al., 2006). One can also estimate the parameters of the probability model and the classifier jointly, which was considered in (Dick et al., 2008; Liao et al., 2007). This techniques work very well when the missing data is conditionally independent of the unobserved features given the observations, but there is no guarantee to get a reasonable estimation in more general missing not at random case.

There is also a group of methods, which does not make any assumptions about the missing data model and makes a prediction from incomplete data directly. In (Chechik et al., 2008) a modified SVM classifier is trained by scaling the margin according to observed features only. The alternative approaches to learning a linear classifier, which avoid features deletion or imputation, are presented in (Dekel et al., 2010; Globerson and Roweis, 2006). Finally, in (Grangier and Melvin, 2010) the embedding mapping of feature-value pairs is constructed together with a classification objective function.

In our contribution, we generalize the imputation-based techniques in such a way to preserve the information of missing features. To select a basepoint we propose to choose the most probable point form a subspace identifying a missing data point, however other imputation methods can be used as well. Constructed representation allows to apply various affine data transformations preserving classical scalar product before applying typical classification methods.

3 Generalized incomplete data

In this section, we introduce the subspace approach to incomplete data. First, we define a generalized missing data point, which allows to perform affine transformation of incomplete data. Then, we show how to embed generalized missing data into a vector space and select a basepoint. Finally, we define a scalar product on the embedding space.

3.1 Incomplete data as pointed affine subspaces

Incomplete data $X$ can be understood as a sequence of pairs $(x_{i},J_{i})$ , where $x_{i}\in\mathbb{R}^{N}$ and $J_{i}\subset\{1,\ldots,N\}$ indicates missing coordinates of $x_{i}$ . Therefore, we can associate a missing data point $(x,J)$ with an affine subspace $x+\mathrm{span}(e_{j})_{j\in J}$ , where $(e_{j})_{j}$ is the canonical base of $\mathbb{R}^{N}$ . Let us observe that $x+\mathrm{span}(e_{j})_{j\in J}$ is a set of all $N$ -dimensional vectors which coincide with $x$ on the coordinates different from $J$ .

In this paper, we focus on transforming incomplete data by affine mappings. For this purpose, we generalize the above representation to arbitrary affine subspaces, or more precisely pointed affine subspaces, which do not have to be generated by canonical bases.

Definition 3.1.

A generalized missing data point is defined as a pointed affine subspace $S_{x}=x+V$ , where $x\in\mathbb{R}^{N}$ is a basepoint and $V=S_{x}-x$ is a linear subspace of $\mathbb{R}^{N}$ .

A basepoint can be selected by filling missing attributes with a use of any imputation method, which will be discussed in the next subsection.

Remark 3.2.

Observe that the notion of pointed affine subspace differs from classical affine subspace. In particular, pointed subspace depends on the selection of basepoint. In consequence, we can create two different generalized missing data points $S_{y},S_{z}$ from the same missing data point $(x,J)$ by using different imputation methods.

First, we show that the above definition is useful for defining linear mappings on incomplete data. Let $S_{x}=x+V$ be a generalized missing data point and let $f:\mathbb{R}^{N}\ni w\to Aw+b$ be an affine map. We can transform a generalized missing data point $x+V$ into another missing data point by the formula:

[TABLE]

The basepoint $x$ is mapped into $Ax+b$ , while the linear part of $f(x+V)$ is given by

[TABLE]

Consequently, we arrive at the definition:

Definition 3.3.

For a a generalized missing data point $S_{x}=x+V$ and an affine mapping $f:w\to Aw+b$ we put:

[TABLE]

where $Ax+b$ is a basepoint and $AV$ is a linear subspace.

One can easily compute and represent $AV$ , if the orthonormal base $v_{1},\ldots,v_{n}$ of $V$ is given, namely we simply orthonormalize the sequence $Av_{1},\ldots,Av_{n}$ .

3.2 Embedding of generalized missing data

The above representation is useful for understanding and performing affine transformations of incomplete data, such as whitening, dimensionality reduction or incorporating affine constraints to data. Nevertheless, typical machine learning methods require vectors or a kind of kernel (or similarity) matrix as the input. We show how to embed generalized missing data into a vector space.

A generalized missing data point $S_{x}=x+V$ consists of a basepoint $x\in\mathbb{R}^{N}$ which is an element of vector space and a linear subspace $V$ . To represent a subspace $V$ , we propose to use a matrix of orthogonal projection $p_{V}$ onto $V$ . To get an exact form of $p_{V}$ , let us assume that $(v_{j})_{j\in J}$ is an orthonormal base of $V$ . Then, the projection of $y\in\mathbb{R}^{N}$ can be calculated by

[TABLE]

which implies that

[TABLE]

The selection of basepoint relies on filling missing attributes with some concrete values, which is commonly known as imputation. In our setting, by the imputation we denote a function $\Phi:X\to\mathbb{R}^{N}$ such that

[TABLE]

for a generalized missing data $S_{x}$ .

In the case of classical incomplete data, missing attributes are often filled with a mean or a median calculated from existing values for a given attribute. However, these imputations cannot be easily defined in a general case, because the linear part of generalized missing data point might be an arbitrary linear subspace (not necessarily a subspace generated by a subset of canonical base). Let us observe that another popular imputation method, which fills the missing coordinates with zeros can be defined for generalized incomplete data. This is performed by selecting a basepoint of an incomplete data point $S_{x}=x+V$ as the orthogonal projection of missing data $x$ onto the subspace orthogonal to $V$ , i.e.:

[TABLE]

where $(v_{j})_{j\in J}$ is an arbitrary orthonormal base of $V$ . If $V$ is represented by canonical base then this is equivalent to filling missing attributes with zeros.

We propose another technique for setting missing values, which extends zero imputation method. Let us assume that $(m,\Sigma)$ are the mean and covariance matrix estimated for incomplete dataset $X$ . In this method, a basepoint of $x+V$ is selected as the orthogonal projection of $m$ onto $x+V$ with respect to the Mahalanobis scalar product parametrized by $\Sigma$ , i.e.

[TABLE]

where $p_{V}^{\Sigma}$ denotes a projection matrix onto $V$ with respect to Mahalanobis scalar product given by $\Sigma$ . To obtain the values for $m$ and $\Sigma$ in practice, one can use existing attributes of incomplete data for the calculation of a sample mean and a covariance matrix. Alternatively, if data satisfy missing at random assumption, then the EM algorithm can be applied to estimate the probability model describing data (Schafer, 1997). We call this technique by the most probable point imputation.

Summarizing, our embedding is defined as follows:

Definition 3.4.

A generalized missing data point is embedded in a vector space by

[TABLE]

where $S_{x}=x+V$ and $x$ is a basepoint.

Example 3.5.

To illustrate the effect of missing data imputation and transformation, let us consider the whitening operation:

[TABLE]

where $\Sigma$ is the covariance, and $m$ the mean of $X$ . For a generalized missing data the above operation is defined by:

[TABLE]

In other words, we map a basepoint in a classical way and transform a subspace $V$ into a linear subspace $\Sigma^{-1/2}V$ . The illustration is given in Figure 2.

Example 3.6.

In the case of high dimensional data, we sometimes reduce a dimension of input data space by applying the Principle Component Analysis, which is defined by:

[TABLE]

where $m$ is a mean of a dataset and $k$ columns of $W$ are the leading eigenvectors of covariance matrix $\Sigma$ . This operation can be extended to the case of generalized missing data by:

[TABLE]

An example of the above operation is illustrated in the Figure 3.

3.3 Scalar product for SVM kernel

To apply most of classification methods it is necessary to define a scalar product (kernel matrix) on a data space. As a natural choice, one could sum the scalar products between basepoints and embedding matrices, i.e.

[TABLE]

However, for a data space of dimension $N$ , we have $\|p_{V}\|^{2}=N$ , which implies that the weight of projection can dominate the first part of \eqrefstd:product concerning basepoints. Consequently, we decided to introduce an additional parameter to allow reducing the importance of projection part:

Definition 3.7.

Let $D\in[0,1]$ be fixed. As a scalar product between two generalized missing data points we put:

[TABLE]

Let us observe that the above parametric scalar product can be implemented by taking the embedding $x+V\to(x,\sqrt{D}p_{V})$ and then using formula \eqrefstd:product for a scalar product.

Remark 3.8.

Observe that the value of function \eqrefeq:productt strictly depends on the selection of basepoints, which makes it not well defined scalar product in the space of classical affine subspaces. Indeed, $x+V$ defines the same affine subspace as $x+v+V$ , where $v\in V$ , but such shifts may lead to different values of the right hand side of \eqrefeq:productt. However, this is well defined scalar product in the case of pointed affine subspaces, because two different selections of basepoints give different pointed affine subspaces (see Remark 3.2). In consequence, it might be safely used in the case of generalized missing data points considered in this paper.

The following proposition shows how to calculate a scalar product between matrices defining two orthogonal projections onto linear subspaces.

Proposition 3.9.

Let us consider subspaces

[TABLE]

where $v_{j}$ and $w_{k}$ are orthonormal sequences. If $p_{V},p_{W}$ denote orthogonal projections onto $V,W$ , respectively, then

[TABLE]

Proof 3.10.

By the definition of orthogonal projections and the scalar product between matrices, we have

[TABLE]

Making use of $\mathrm{tr}(AB)=\mathrm{tr}(BA)$ , we get

[TABLE]

Finally,

[TABLE]

Concluding, the scalar product between embedding of two generalized missing data points given by Definition 3.7 can be calculated as:

[TABLE]

where $(v_{j})_{j\in J},(w_{k})_{k\in K}$ are orthonormal bases of $V,W$ , respectively. The last expression can be more numerically efficient if the dimension of the subspaces (the number of missing attributes) is much smaller than the dimension of the whole space.

Remark 3.11.

One of typical representations of missing data $(x,J)$ relies on filling unknown attributes and supplying it with a binary flag vector $\mathds{1}_{J}\in\mathbb{R}^{N}$ , in which bit $1$ denotes coordinate belonging to $J$ . This leads to the embedding of the missing data into a vector space given by

[TABLE]

Then, the scalar product of such embedding can be defined by

[TABLE]

It is worth to noting that the formula \eqrefeq:scalar coincides with a scalar product defined for generalized missing data \eqrefstd:product (for $D=1$ ). Indeed, if $V=\mathrm{span}(e_{j}:j\in J)$ and $W=\mathrm{span}(e_{k}:k\in K)$ , for $J,K\subset\{1,\ldots,N\}$ , then by Proposition 3.9 we have,

[TABLE]

which is exactly the RHS of \eqrefeq:scalar.

Therefore, our approach generalizes and theoretically justifies the flag approach to missing data analysis. The importance of our construction lies in its generality, which in particular allows for performing typical affine transformations of data. In the case of flag representation, there is no obvious solution how to perform such mappings on flag vector.

4 Experiments

To illustrate our approach we applied it in SVM classification experiments, which assumed the use of whitening operation before performing a classification phase. We used examples retrieved from UCI repository combined with two strategies for attributes removal: random and structural. Finally, one real medical dataset was employed, which simulates a real process of missing features.

For all cases, the following procedure was applied. First, we set missing features with a use of one of four strategies mentioned in the paper:

Mean: average value of the feature over training set. 2. 2.

Median: median of the feature over training set. 3. 3.

Zero imputation: missing features were filled with zeros. 4. 4.

Most probable imputation: it was described in section 3.2.

For a simplicity the mean and covariance matrix were estimated from a training set with a use of norm R package333Since the use of EM method implemented in norm is justified in missing at random case, then one could also estimated a mean and covariance based on existing attributes..

Next, we performed a whitening of dataset (making use of the parameters returned by norm) based on two approaches:

No information: Feature vectors with imputed missing attributes were whiten. 2. 2.

Subspace: Feature vectors with imputed values were joined with corresponding projection matrices and then the entire vectors were whiten according to the Definition 3.3.

The above scenarios represent classical imputation and our pointed affine subspace approach. We would like to investigate how the information preserved in the subspace influences the classification results.

Finally, we calculated the scalar products (kernel matrices) for such representations of data and trained SVM classifier implemented in libsvm (Chang and Lin, 2011). Missing features of test set instances were filled and transformed based on a training set only.

All experiments assumed double 5-fold cross validation. More precisely, for every division into train and test sets, the required hyperparameters were tuned using inner 5-fold cross validation applied on training set. The combination of parameters maximizing mean accuracy score (on validation set) was used to learn a final classifier on a entire training set, while the performance was evaluated on a testing set that was not used during training. The accuracy was averaged over all 5 trails. We learned a standard margin parameter $C$ as well as a parameter $D$ in the formula of scalar product for subspace embedding. We performed a grid search in the following ranges: $C=\{10^{k}\,:\ k=-2,-1,0,1\}$ and $D=\{\frac{1}{2^{k}}\,:\ k=0,1,\ldots,10\}$ .

4.1 UCI datasets

We used three UCI datasets (for datasets with more than two classes we selected two the most numerous classes): breast cancer (BC), ionosphere (IS) and yeast (Y) (Asuncion and Newman, 2007). In the first case, we randomly removed $90\%$ of features. In the second option, we defined a structural process of attributes removal. More precisely, we drawn $N$ points $x_{1},\ldots,x_{N}$ of a dataset $X\subset\mathbb{R}^{N}$ . Then, for every $x\in X$ we removed its $i$ -th attribute with a probability $\exp(-t\|x-x_{i}\|_{\Sigma}))$ , where $\|x\|_{\Sigma}$ denotes the Mahalanobis norm of $x$ with respect to $\Sigma$ and $t>0$ was chosen to remove approximately $90\%$ of attributes.

The results presented in Table 1 show that there is no benefit from identifying absent attributes when the features were missing completely at random. One can observe that most probable point imputation usually provided the highest accuracy among the imputation strategies.

In the case of structurally missing features, Table 2, the proposed subspace approach gave better classification results for all datasets and for all cases of imputations. Moreover, the most probable point imputation outperformed other strategies of filling missing coordinates on two out of three datasets.

4.2 Medical data

In this application we considered a real angiological dataset acquired from Jagiellonian Center of Experimental Therapeutic containing patients’ examinations, http://jcet.eu/new_en/. The goal was to find patients with atherosclerosis. Innovative medical tests are very expensive, time-consuming and in some cases they cannot be successfully completed due to the patient’s condition. In consequence, research database contains many empty cells, which is the effect of purely structural process. Since some of parameters are discrete as well as real valued numbers presented in different scales, then a whitening of data is a natural preprocessing step.

The results illustrated in Table 3 partially confirm the hypothesis suggested in previous experiment. Indeed, the use of proposed subspace embedding, gave higher accuracy for all imputation strategies, but the benefit from its application was not significant. It is difficult to decide which imputation strategy was optimal because all of them provided comparable results.

5 Conclusion

The paper generalized the existing approach of identifying missing attributes with binary flags. To enable appropriate affine transformations of data, we represented incomplete data as pointed affine subspaces and embedded them into a vector space by linking a pointed subspace with a basepoint joined with a corresponding projection matrix. In the same spirit we proposed to select a basepoint as the most probable point from a subspace, which extends the well-known zero imputation strategy. Such a combination provided the best performance in conducted classification experiments in most cases.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Asuncion and Newman (2007) Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007. URL http://www.ics.uci.edu/$ ∼ $mlearn/{MLR}epository.html .
2Berg et al. (2005) Alexander C. Berg, Tamara L. Berg, and Jitendra Malik. Shape matching and object recognition using low distortion correspondences. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , pages 26–33. IEEE, 2005.
3Burke et al. (1997) Lora E Burke, Jacqueline M Dunbar-Jacob, and Martha N Hill. Compliance with cardiovascular disease prevention strategies: a review of the research. Annals of Behavioral Medicine , 19(3):239–263, 1997.
4Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology , 2(3):27, 2011.
5Chechik et al. (2008) Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research , 9:1–21, 2008.
6Dekel et al. (2010) Ofer Dekel, Ohad Shamir, and Lin Xiao. Learning to classify with missing and corrupted features. Machine Learning , 81(2):149–178, 2010.
7Dick et al. (2008) Uwe Dick, Peter Haider, and Tobias Scheffer. Learning from incomplete data with infinite imputations. In Proceedings of the International Conference on Machine Learning , pages 232–239. ACM, 2008.
8Ghahramani and Jordan (1994) Zoubin Ghahramani and Michael I Jordan. Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems , pages 120–127. Citeseer, 1994.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Pointed subspace approach to incomplete data

Abstract

keywords:

1 Introduction

2 Related works

3 Generalized incomplete data

3.1 Incomplete data as pointed affine subspaces

Definition 3.1**.**

Remark 3.2**.**

Definition 3.3**.**

3.2 Embedding of generalized missing data

Definition 3.4**.**

Example 3.5**.**

Example 3.6**.**

3.3 Scalar product for SVM kernel

Definition 3.7**.**

Remark 3.8**.**

Proposition 3.9**.**

Proof 3.10**.**

Remark 3.11**.**

4 Experiments

4.1 UCI datasets

4.2 Medical data

5 Conclusion

Definition 3.1.

Remark 3.2.

Definition 3.3.

Definition 3.4.

Example 3.5.

Example 3.6.

Definition 3.7.

Remark 3.8.

Proposition 3.9.

Proof 3.10.

Remark 3.11.