Clustering with Jointly Learned Nonlinear Transforms Over Discriminating   Min-Max Similarity/Dissimilarity Assignment

Dimche Kostadinov; Behrooz Razeghi; Taras Holotyak; Slava; Voloshynovskiy

arXiv:1901.10760·cs.LG·January 31, 2019

Clustering with Jointly Learned Nonlinear Transforms Over Discriminating Min-Max Similarity/Dissimilarity Assignment

Dimche Kostadinov, Behrooz Razeghi, Taras Holotyak, Slava, Voloshynovskiy

PDF

Open Access

TL;DR

This paper introduces a novel clustering method that jointly learns nonlinear transforms with priors, using a min-max measure for discriminative assignment, demonstrating improved performance on image clustering tasks.

Contribution

It proposes a new clustering framework based on jointly learned nonlinear transforms and a min-max discriminative measure, enhancing clustering accuracy.

Findings

01

Outperforms state-of-the-art clustering methods on image datasets

02

Demonstrates the effectiveness of jointly learned nonlinear transforms

03

Validates the approach through numerical experiments

Abstract

This paper presents a novel clustering concept that is based on jointly learned nonlinear transforms (NTs) with priors on the information loss and the discrimination. We introduce a clustering principle that is based on evaluation of a parametric min-max measure for the discriminative prior. The decomposition of the prior measure allows to break down the assignment into two steps. In the first step, we apply NTs to a data point in order to produce candidate NT representations. In the second step, we preform the actual assignment by evaluating the parametric measure over the candidate NT representations. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods.

Tables5

Table 1. Table 1: The computational efficiency per iteration t [ s e c ] 𝑡 delimited-[] 𝑠 𝑒 𝑐 t[sec] for the proposed algorithm, the conditioning number κ n ( 𝐀 ) = σ m a x σ m i n subscript 𝜅 𝑛 𝐀 subscript 𝜎 𝑚 𝑎 𝑥 subscript 𝜎 𝑚 𝑖 𝑛 {\kappa}_{n}({\bf A})=\frac{\sigma_{max}}{\sigma_{min}} and the expected mutual coherence μ ( 𝐀 ) 𝜇 𝐀 \mu({\bf A}) for the liner map 𝐀 𝐀 {\bf A} .

		COIL			ORL			E-YALE-B			AR
		$κ_{n}$	$μ$	$t$	$κ_{n}$	$μ$	$t$	$κ_{n}$	$μ$	$t$	$κ_{n}$	$μ$	$t$
		16	.2e-5	46	21	.3e-5	48	31	.1e-5	51	28	.3e-5	69

Table 2. Table 2: The clustering performance over the databases COIL, ORL, E-YALE-B and AR evaluated using the Cluster Accuracy (CA) and the Normalized Mutual Information (NMI) metrics.

	COIL	ORL	E-YALE-B	AR
CA $%$	$89.2$	$75.4$	$96.8$	$94.8$
NMI $%$	$91.2$	$84.1$	$95.3$	$94.1$

Table 3. Table 3: A comparative results between state-of-the-art (Lu et al., 2013 ) , (Zheng et al., 2011 ) , (Yin et al., 2016 ) , (Guo, 2015 ) and (Kodirov et al., 2016 ) and the proposed method ( ∗ ) (*) .

	CA $%$
	COIL	ORL	E-YALE-B
CASS (Lu et al., 2013)	59.1	68.8	81.9
GSC (Zheng et al., 2011)	80.9	61.5	74.2
NSLRR (Yin et al., 2016)	62.8	55.3	/
SDRAM (Guo, 2015)	86.3	70.6	92.3
RGRSC (Kodirov et al., 2016)	88.1	$76.3$	95.2
$(*)$	$89.2$	$75.4$	$96.8$

Table 4. Table 4: A comparative results between state-of-the-art (Lu et al., 2013 ) , (Zheng et al., 2011 ) , (Yin et al., 2016 ) , (Guo, 2015 ) and (Kodirov et al., 2016 ) , and the proposed method ( ∗ ) (*) .

	NMI $%$
	COIL	ORL	E-YALE-B
CASS (Lu et al., 2013)	64.1	78.1	78.1
GSC (Zheng et al., 2011)	87.5	76.2	75.0
NSLRR (Yin et al., 2016)	75.6	74.5	/
SDRAM (Guo, 2015)	89.1	80.2	89.1
RGRSC (Kodirov et al., 2016)	89.3	$86.1$	94.2
$(*)$	$91.2$	$84.1$	$95.3$

Table 5. Table 5: The k-NN accuracy results using assigned NT representations and original data (OD) representation.

	COIL	ORL	E-YALE-B	AR
acc. NT	$97.1$	$96.9$	$96.8$	$96.0$
acc. OD	$94.0$	$94.5$	$93.4$	$91.6$

Equations94

{\hat{Y}, \hat{C}, \hat{θ}} = ar g Y, C, θ min i = 1 \sum C K [g (x_{i}, C y_{i}) + λ_{0} f_{s} (y_{i}) +

{\hat{Y}, \hat{C}, \hat{θ}} = ar g Y, C, θ min i = 1 \sum C K [g (x_{i}, C y_{i}) + λ_{0} f_{s} (y_{i}) +

λ_{1} f_{c} (y_{i}, θ)] + λ_{2} f_{d} (C),

T_{T}

T_{T}

P_{T} =

P_{T} =

P_{c_{1}, c_{2}} =

{c_{1}, c_{2}} \in

f_{θ} : y_{i} =

f_{θ} : y_{i} =

{\overset{c}{^}_{1} j_{1} (i), \overset{c}{^}_{2} j_{2} (i)} ≃ ar g c_{1} \in C_{d} min

{\overset{c}{^}_{1} j_{1} (i), \overset{c}{^}_{2} j_{2} (i)} ≃ ar g c_{1} \in C_{d} min

ς (y ∣_{{c_{1}, c_{2}}}, τ_{c_{1}}))],

A x_{i} =

A x_{i} =

y ∣_{{c_{1}, c_{2}}} =

p (y_{i} ∣ x_{i}, A) = \int_{θ} p (y_{i}, θ ∣ x_{i}, A) d θ .

p (y_{i} ∣ x_{i}, A) = \int_{θ} p (y_{i}, θ ∣ x_{i}, A) d θ .

p (y_{i}, θ ∣ x_{i}, A) \propto p (x_{i} ∣ y_{i}, θ, A) p (y_{i}, θ ∣ A) .

p (y_{i}, θ ∣ x_{i}, A) \propto p (x_{i} ∣ y_{i}, θ, A) p (y_{i}, θ ∣ A) .

p (θ, y_{i} ∣ A) = p (θ, y_{i}),

p (θ, y_{i} ∣ A) = p (θ, y_{i}),

θ =

θ =

dissimilarity: θ_{1} =

similarity: θ_{2} =

p (x_{i} ∣

p (x_{i} ∣

exp [- \frac{1}{β _{0}} u_{r} (A x_{i}, y_{i}) - \frac{1}{β _{a}} u_{a} (A x_{i}, θ)],

u_{r} (A x_{i}, y_{i}) = ∥ A x_{i} - y_{i} ∥_{2}^{2} .

u_{r} (A x_{i}, y_{i}) = ∥ A x_{i} - y_{i} ∥_{2}^{2} .

u_{a} (A x_{i}, θ) = ∥ A x_{i} - τ_{j_{1} (i)} - ν_{j_{2} (i)} ∥_{2}^{2},

u_{a} (A x_{i}, θ) = ∥ A x_{i} - τ_{j_{1} (i)} - ν_{j_{2} (i)} ∥_{2}^{2},

p (

p (

exp [- \frac{1}{β _{d}} f_{c} (θ, y_{i}) - \frac{1}{β _{E}} u_{p} (θ) - \frac{1}{β _{1}} ∥ y_{i} ∥_{1}],

f_{c} (y_{i}, θ) = c_{1} \in C_{d} min [\frac{ϱ ( y _{i} , τ _{c_{1}} )}{max _{c_{2} \in C_{s}} ϱ ( y _{i} , ν _{c_{2}} )} + ς (y_{i}, τ_{c_{1}})] .

f_{c} (y_{i}, θ) = c_{1} \in C_{d} min [\frac{ϱ ( y _{i} , τ _{c_{1}} )}{max _{c_{2} \in C_{s}} ϱ ( y _{i} , ν _{c_{2}} )} + ς (y_{i}, τ_{c_{1}})] .

u_{p} (θ) =

u_{p} (θ) =

θ = {θ_{1}, θ_{2}}, while:

θ = {θ_{1}, θ_{2}}, while:

θ_{∖ c_{1}} = {{τ_{1}, ..., τ_{c_{1} - 1}, τ_{c_{1} + 1}, ..., τ_{C_{d}}}, θ_{2}},

θ_{∖ c_{2}} = {θ_{1}, {ν_{1}, ..., ν_{c_{2} - 1}, ν_{c_{2} + 1}, ..., ν_{C_{s}}}} .

{\hat{Y}, \hat{A}, \hat{θ}} =

{\hat{Y}, \hat{A}, \hat{θ}} =

ar g Y, A, θ min i = 1 \sum C K \frac{1}{2} ∥ A x_{i} - y_{i} ∥_{2}^{2} + λ_{2} u_{a} (A x_{i}, θ) - l o g p (x_{i} ∣ y_{i}, θ, A) +

λ_{0} f_{c} (y_{i}, θ) + λ_{1} ∥ y_{i} ∥_{1} + λ_{E} u_{p} (θ) - l o g p (y_{i}, θ) + f_{d} (A) - l o g p (A),

\hat{Y} =

\hat{Y} =

ar g Y min i = 1 \sum C K [\frac{1}{2} ∥ q_{i} - y_{i} ∥_{2}^{2} + λ_{1} ∥ y_{i} ∥_{1} + λ_{0} f_{c} (y_{i}, θ)],

\hat{y}_{i} = ar g y_{i} min

\hat{y}_{i} = ar g y_{i} min

ar g c_{1} \in C_{d} c_{2} \in C_{s} min

λ_{0} (\frac{ϱ ( y _{i} , τ _{c_{1}} )}{ϱ ( y _{i} , ν _{c_{2}} )} + ς (y_{i}, τ_{c_{1}})) s_{p} (c_{1}, c_{2}) ⎭ ⎬ ⎫,

y ∣_{{c_{1}, c_{2}}} =

y ∣_{{c_{1}, c_{2}}} =

ar g y ∣_{{c_{1}, c_{2}}} min (\frac{1}{2} ∥ q_{i} - y ∣_{{c_{1}, c_{2}}} ∥_{2}^{2} + λ_{1} 1^{T} ∣ y ∣_{{c_{1}, c_{2}}} ∣) l_{1} (c_{1}, c_{2}) +

ar g y ∣_{{c_{1}, c_{2}}} min (\frac{1}{2} ∥ q_{i} - y ∣_{{c_{1}, c_{2}}} ∥_{2}^{2} + λ_{1} 1^{T} ∣ y ∣_{{c_{1}, c_{2}}} ∣) l_{1} (c_{1}, c_{2}) +

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Neural Networks and Applications · Remote-Sensing Image Classification

Full text

Clustering with Jointly Learned Nonlinear Transforms Over

Discriminating Min-Max Similarity/Dissimilarity Assignment

Dimche Kostadinov

Behrooz Razeghi

Taras Holotyak

Slava Voloshynovskiy

Abstract

This paper presents a novel clustering concept that is based on jointly learned nonlinear transforms (NTs) with priors on the information loss and the discrimination. We introduce a clustering principle that is based on evaluation of a parametric min-max measure for the discriminative prior. The decomposition of the prior measure allows to break down the assignment into two steps. In the first step, we apply NTs to a data point in order to produce candidate NT representations. In the second step, we preform the actual assignment by evaluating the parametric measure over the candidate NT representations. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods.

Machine Learning, ICML

1 Introduction

Clustering is one of the most important unsupervised learning task in the areas of signal processing, machine learning, computer vision and artificial intelligence that has been extensively studied for decades. Commonly, the data clustering algorithms (Cover and Thomas, 2006), (Hoyer and Dayan, 2004), (Guo et al., 2012), (Jiang et al., 2013), (Cai et al., 2014), (Shekhar et al., 2014), (Xu et al., 2005), (Bach and Harchaoui, 2008) and (Krause et al., 2010) address the problem of identification and description of the underlining clusters that explain the data.

Among the various types of clustering algorithms, the k-means and matrix decomposition based methods are one of the most popular and practically useful approaches. Given a data set, in the most common case, the objective of a clustering algorithm is to minimize the inter-cluster cost, i.e., the measured similarity between the data cluster and the data points under that cluster and maximize the intra-cluster cost, i.e., the measured similarity between the data cluster and the data points that do not belong to that cluster. A data factorization/decomposition model (Cover and Thomas, 2006) (Hoyer and Dayan, 2004) (Krause et al., 2010) (Vidal, 2011) with constraints summarizes a general problem formulation that also subsumes the previously explained basic case. We express it in the following:

[TABLE]

where ${\bf C}=[{\bf c}_{1},...,{\bf c}_{C}]\in\Re^{N\times C}$ are the clusters, ${\bf x}_{i}\in\Re^{N}$ is the $i$ -th data point, ${\bf y}_{i}\in\Re^{C}$ is its data representation over the clusters, $\bm{\theta}$ are parameters responsible for a tasks specific functionality, $g(.,.)$ is the similarity measure between the data point and the representation over the clusters, $f_{c}(.,.)$ and $f_{s}(.)$ are the task specific and sparsity penalty functions, respectively, $f_{d}(.)$ is penalty on the cluster properties and $\{\lambda_{0},\lambda_{1},\lambda_{2}\}$ are Lagrangian parameters. The cluster assignment in (1) is based on the synthesis model (Aharon et al., 2006), (R. et al., 2013), where usually ${\bf x}_{i}$ is reconstructed and represented by a sparse linear combination ${\bf y}_{i}$ over the clusters ${\bf C}$ as ${\bf x}_{i}\simeq{\bf C}{\bf y}_{i}$ . In essence, the crucial element behind this clustering principle is the used measure $g(.,.)$ for similarity as well as the penalty functions $f_{c}(.,.)$ , $f_{s}(.)$ and $f_{d}(.)$ which have significant role in the cluster vectors estimation and impact the resulting cluster assignment. Due to the used model, the solution to (1) might not only have high computational complexity, but also there might be difficulties in modeling and imposing constraints ( $f_{c}(.,.)$ , $f_{s}(.)$ and $f_{d}(.)$ in (1)) that are requered in order to preserve specific data properties, like structured sparsity (Hoyer and Dayan, 2004), pairwise constraints (Shekhar et al., 2014), data subspace (Elhamifar and Vidal, 2009), (Vidal, 2011), (Lu et al., 2012), graph structure and manifold curvature (Krause et al., 2010) and (Daitch et al., 2009).

On the other hand, beside the synthesis model, in the area of signal processing, the other two commonly used models are the analysis model (Rubinstein et al., 2013) and the sparsyfying transform model (Rubinstein and Elad, 2014) and (Ravishankar and Bresler, 2014). In the transform model, ${\bf y}_{i}$ is a nonlinear transform (NT) representation that is estimated using a linear mapping ${\bf A}{\bf x}_{i}$ , with map ${\bf A}\in\Re^{M\times N}$ , which then is followed by an element-wise nonlinearity and it represents a solution to a constrained projection problem. Under this model, the computational complexity for estimating the representation ${\bf y}_{i}$ is low, but, so far, was not addressed, nor considered as basis for clustering or learning discriminative NT representations. In addition, in spite the fact that an NT model offers a high degree of freedom for modeling a wide class of constraints111Many nonlinearities, i.e., ReLu, $p$ -norms, elastic net-like, $\frac{\ell_{1}}{\ell_{2}}$ -norm ratio, binary encoding, ternary encoding, etc., can be expressed and modeled by a nonlinear transform., robust assignment cost under NT model that is based on a parametric measure which jointly takes into account not only similarity, but also dissimilarity contribution, was not studied nether explored.

1.1 Nonlinear Transform Model, Assignment Principle and Learning Strategy Outline

In this paper, we introduce an assignment based nonlinear transform model for clustering.

Assignment Based NT Model We addresses the problem of estimating the parameters that model the probability $p({\bf y}_{i}|{\bf x}_{i},{\bf A})=\int_{\bm{\theta}}p({\bf y}_{i},\bm{\theta}|{\bf x}_{i},{\bf A})d\bm{\theta}$ of assigning a cluster and a nonlinear transform representation ${\bf y}_{i}\in\Re^{M}$ for input data ${\bf x}_{i}\in\Re^{N}$ by using the parameters $\bm{\theta}$ and ${\bf A}\in\Re^{M\times N}$ .

We motivate the use of ${\bf A}$ , in order to extract useful data properties. If $M=N$ , a suitable prior allows us to model a metric (linear map) to achieve invariance (Li et al., 2016). In this paper, we model overcomplete ${\bf A}$ with $M>N$ using the prior $p({\bf A})$ . Essentially, with our prior, we introduce redundancy in a constrained way, while we approximatively preserve the properties of the original data in order to pronounce discrimination among the assigned NT representations ${\bf y}_{i}$ . Nonetheless, we note that even in the case of ${\bf A}={\bf I}$ , we can use our model, which reduces to a nonlinear assignment in the original space of ${\bf x}_{i}$ , i.e., $p({\bf y}_{i}|{\bf x}_{i})=\int_{\bm{\theta}}p({\bf y}_{i},\bm{\theta}|{\bf x}_{i})d\bm{\theta}$ . Our assignment measure is nonlinear. In general, not all nonlinear function can highlight relevant data properties that are related to discrimination. We consider a piece-wise linear nonlinarity. In order to address the robustness in the assignment, we explicitly model parameters $\bm{\theta}=\{\bm{\theta}_{1},\bm{\theta}_{2}\}$ related to both similarity and dissimilarity contribution. Their role is the discrimination functionality that we address by using a composite min-max assignment measure. In its evaluation, due to the decomposition of the min-max assignment measure, the pair $\{\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ from $\{\bm{\theta}_{1},\bm{\theta}_{2}\}$ $=\{\{\bm{\tau}_{1},...,\bm{\tau}_{C_{d}}\},$ $\{\bm{\nu}_{1},...,\bm{\nu}_{C_{s}}\}\}\in\Re^{M\times(C_{d}+C_{s})}$ has additional interpretation. It is viewed as NT specific parameter, which is used to produce the respective candidate NT representation.

Cluster and NT Representation Assignment During cluster assignment, instead of description by clusters, we rely on candidate NTs. We estimate a single candidate NT representation as a solution to a direct, constrained projection problem. To attain unique and distinctive patterns in the resulting candidate NT representation, we parameterize the corresponding candidate NT with shared ${\bf A}$ and distinct $\{\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ that are used for the element-wise nonlinearity.

When (1) is used, ${\bf x}_{i}$ has only one representation that usually under certain similarity score is related to the likelihood of the assignment w.r.t. clusters. On the other hand, in our model, we apply a number of candidate NTs on a data point ${\bf x}_{i}$ , which results in a number of candidate NT representations. Afterwords, our assignment is based on evaluating a min-max similarity/dissimilarity score using all of the candidate NT representations and their corresponding parameters $\{\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ . Nonetheless, based on the same assignment score, we describe ${\bf x}_{i}$ by only one candidate NT representation. In fact, in this way, we simultaneously assign both the cluster index and the NT representation ${\bf y}_{i}$ based on the evaluation of the min-max measure.

Learning Strategy In order to estimate the parameters of our model, we consider $p({\bf Y},{\bf A}|{\bf X})=p({\bf Y}|{\bf X},{\bf A})p({\bf A}|{\bf X})=\prod_{i=1}^{CK}\int_{\bm{\theta}}p({\bf y}_{i},\bm{\theta}|{\bf x}_{i},{\bf A})d\bm{\theta}p({\bf A}|{\bf x}_{i})$ . Its maximization over ${\bf Y},\bm{\theta}$ and ${\bf A}$ is difficult. We address a point-wise approximation to the marginal $\int_{\bm{\theta}}p({\bf y}_{i},\bm{\theta}|{\bf x}_{i},{\bf A})d\bm{\theta}$ , which allows us to derive an efficient learning algorithm.

Compared with the factorization based clustering methods (1), the fundamental difference of our approach is the used model. The factorization/decomposition model addresses the joint data reconstruction and cluster estimation with constraints by solving inverse problem, w.r.t. the model ${\bf x}_{i}={\bf C}{\bf y}_{i}+{\bf e}_{i}$ , where ${\bf e}_{i}\in\Re^{N}$ is the error vector. In our assignment based nonlinear transform model, we address joint learning of data projections (NTs) with information loss and discriminative priors, by solving direct, constrained projection problems, based on the candidate models in the form ${\bf A}{\bf x}_{i}={\bf y}_{i}+{\bf z}_{i}$ , where ${\bf z}_{i}\in\Re^{M}$ is the NT error vector.

1.2 Contributions

In the following, we outline our contributions.

(i) We introduce novel cluster assignment principle that is centered on two elements: (1) joint modeling and learning of nonlinear transforms (NTs) with priors and (2) cluster and NT representation assignment based on a min-max score. To the best of our knowledge, our novel discriminative assignment principle is first of this kind that:

(a)

Introduces a clustering concept that is based on modeling a direct problem 2. (b)

Addresses a trade-off between robustness in the cluster assignment and the NT representaion compactness by allowing reduction or extension of the NT dimensionality while increasing or decreasing the number of the discrimination parameters $\bm{\theta}$ 3. (c)

Offers cluster assignment over a wide class of similarity score functions including a min-max while enabling efficient estimation of the NT representation 4. (d)

Allows a rejection option and cluster grouping over continues, discontinues and overlapping regions in the transform domain.

(ii) We propose an efficient learning strategy in order to estimate the parameters of the NTs. We implement it by an iterative alternating algorithm with three steps. At each step we give an exact and approximate closed form solution.

(iii) We present numerical experiments that validate our model and learning principle on several image data sets. Our preliminary results on an image clustering task demonstrate advantages in comparison to the state-of-the-art methods, w.r.t. the computational efficiency in training and test time and the used clustering performance measures.

2 Related Work

In the following, we describe the related prior work.

K-means, Matrix Factorization Models and Dictionary Learning Factor analysis (Child, 2006) and matrix factorization (Hoyer and Dayan, 2004) relay on decomposition on hidden features without or with constraints. One special case with only a constraint on the sparsity of the hidden representation, which is considered as a ”hard” assignment is the basic k-means (Cover and Thomas, 2006) algorithm. When discrimination constraints are present, they act as regularization, which were mainly defined using labels in the discriminative dictionary learning methods (Jiang et al., 2013), (Cai et al., 2014) and (Shekhar et al., 2014).

Kernel, Subspace and Manifold Based Clustering Intended to capture the nonlinear structure of the data with outliers and noise, the kernel k-means algorithms (Dhillon et al., 2004) and (Chitta et al., 2011) have been proposed. Also, many subspace clustering methods were proposed (Vidal, 2011), (Ma et al., 2008), (Ma et al., 2007), (Lu et al., 2012), (Elhamifar and Vidal, 2009) and (Bradley and Mangasarian, 2000). Commonly they consist of (i) subspace learning via matrix factorization and (ii) grouping of the data into clusters in the learned subspace. Some authors (Daitch et al., 2009) even include a graph regularization into the subspace clustering.

Discriminative Clustering In (Xu et al., 2005) clustering with maximum margin constraints was proposed. The authors in (Bach and Harchaoui, 2008) proposed linear clustering based on a linear discriminative cost function with convex relaxation. In (Krause et al., 2010) regularized information maximization was proposed and simultaneous clustering and classifier training was preformed. The above methods rely on kernels and account high computational complexity.

Self-Supervision, Self-Organization and Auto-Encoders In self-supervised learning (Doersch et al., 2015), (Pathak et al., 2016) the input data determine the labels. In self-organization (Kohonen, 1982), (Vesanto and Alhoniemi, 2000) a neighborhood function is used to preserve the topological properties of the input space. Both of the approaches leverage implicit discrimination using the data. The single layer auto-encoder (Baldi, 2011) and its denoising extension (Vincent et al., 2010) consider robustness to noise and reconstruction. While the idea is to encode and decode the data using a reconstruction loss, an explicit constraint that enforces discrimination is not addressed.

3 Assignment Based NT With Priors

In the following, we introduce our deterministic model and show how $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ are used in the NTs to produce the candidate representations and how we perform cluster and NT representation assignment. Given ${\bf A}$ and ${\bm{\theta}}$ , we model an assignment over $C_{d}C_{s}$ candidate nonlinear transforms:

[TABLE]

which are defined by the corresponding set of parameters:

[TABLE]

All of the candidate nonlinear transforms $\mathcal{T}_{\mathcal{P}_{c_{1},c_{2}}}$ in the set $\mathcal{T}_{T}$ share the linear map ${\bf A}$ and have distinct $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ . A single $\mathcal{T}_{\mathcal{P}_{c_{1},c_{2}}}$ from the set $\mathcal{T}_{T}$ is indexed using the index pair $\{c_{1},c_{2}\}$ or using the single index computed as $c=c_{2}+(c_{1}-1)C_{s}$ , $c_{1}\in\mathcal{C}_{d}$ and $c_{2}\in\mathcal{C}_{s}$ .

A compact description of our assignment model that takes into account $C_{d}C_{s}$ candidate NT representations while evaluating a parametric discrimination score is the following:

[TABLE]

where $\varrho(.)$ and $\varsigma(.)$ are measures, which will be explained in details in the following subsection. A single candidate nonlinear transform model is defined as follows:

[TABLE]

and $\mathcal{T}_{\mathcal{P}_{c_{1},c_{2}}}({\bf x}_{i}):\Re^{N}\rightarrow\Re^{M}$ is the parametric candidate NT that produces the candidate NT representation ${\bf y}|_{\{c_{1},c_{2}\}}$ , by using the set of parameters $\mathcal{P}_{c_{1},c_{2}}=\{{\bf A},\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ , while ${\bf z}|_{\{c_{1},c_{2}\}}\in\Re^{M}$ is the NT error vector.

3.1 The Probabilistic Assignment Based NT Model

In probability, we use the following model:

[TABLE]

Furthermore, we use the Bayes’ rule, disregard the prior $p({\bf x}_{i}|{\bf A})$ and focus on the proportional relation, i.e.:

[TABLE]

In our model, the probability $p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})$ takes into account the NT error and the discrimination parameter adjustment error. While $p({\bf y}_{i},\bm{\theta}|{\bf A})$ is the parametric discriminative prior, which we simplify it by the following assumption:

[TABLE]

where we disregard the dependences on ${\bf A}$ . We denote the discrimination related parameters as:

[TABLE]

which are used in our min-max assignment over dissimilarity and similarity contributions w.r.t. all pairs $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ .

3.2 Assignment and Adjustment Modeling

We model the assignment based nonlinear transform together with the discrimination parameters using the r.h.s. of (8).

Using two measures, we define $p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})$ as follows:

[TABLE]

where $u_{r}({\bf A}{\bf x}_{i},{\bf y}_{i})$ and $u_{a}({\bf A}{\bf x}_{i},\bm{\theta})$ take into account the NT and the discrimination parameter adjustment errors, respectively, while $\beta_{0}$ and $\beta_{a}$ are scaling parameters.

NT Error Note that in (4), ${\bf y}_{i}$ is evaluated using an assignment over the candidate NT representations ${\bf y}|_{\{c_{1},c_{2}\}}$ that result from applying the nonlinear transform $\mathcal{T}_{\mathcal{P}_{c_{1},c_{2}}}({\bf x}_{i})$ on ${\bf A}{\bf x}_{i}$ . Therefore, we can say that the term ${\bf z}|_{\{j_{1}(i),j_{2}(i)\}}={\bf A}{\bf x}_{i}-{\bf y}|_{\{j_{1}(i),j_{2}(i)\}}$ , i.e., ${\bf z}_{i}={\bf A}{\bf x}_{i}-{\bf y}_{i}$ is the nonlinear transform error vector that represents the deviation of ${\bf A}{\bf x}_{i}$ from the targeted transform representation ${\bf y}_{i}$ . In the simplest form, we assume ${\bf z}_{i}$ to be Gaussian distributed and we model:

[TABLE]

NT Parameters Adjustment We assume that the adjustment of any of the NT discrimination parameters is w.r.t. the following measure:

[TABLE]

where $j_{1}(i)$ and $j_{2}(i)$ denote the indexes of the corresponding $\bm{\tau}_{j_{1}(i)}$ and $\bm{\nu}_{j_{2}(i)}$ that are related to the assigned ${\bf y}|_{\{j_{1}(i),j_{2}(i)\}}$ from the set of $C_{d}C_{s}$ candidate representations ${\bf y}|_{\{c_{1},c_{2}\}}$ under model (4) with the criteria (5).

With (13) and the discrimination parameter prior (that we will describe in the next subsection), we assume that the linear transform representation ${\bf A}{\bf x}_{i}$ decomposes into two distinct components, which respectively are related to the dissimilarity and similarity parameters $\bm{\tau}_{j_{1}(i)}$ and $\bm{\nu}_{j_{2}(i)}$ . The prior (13) is crucial for a proper adjustment between the linear mapping and the pairs $\{\bm{\tau}_{j_{1}(i)},\bm{\nu}_{j_{2}(i)}\}$ as well as for enabling the candidate NTs to discriminate in the transform domain based on their respective pair $\{\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ .

3.3 Priors Modeling

A prior $p({\bf A})$ is used to allow adequate regularization of the coherence and conditioning on the transform matrix ${\bf A}$ , whereas the joint modeling of the $C_{s}C_{d}$ NTs is enabled by using the prior $p(\bm{\theta},{\bf y}_{i})$ .

Minimum Information Loss Prior

By the term ”minimum information loss” we mean that the linear map ${\bf A}$ approximatively preserves the data properties in the transform space. In order to simplify, we assume that $p({\bf A}|{\bf X})=p({\bf A})$ and define our prior $p({\bf A})$ as $p({\bf A})\propto\exp(-f_{d}({\bf A}))$ , where $f_{d}({\bf A})=(\frac{1}{\beta_{3}}\|{\bf A}\|_{F}^{2}+\frac{1}{\beta_{4}}\|{\bf A}{\bf A}^{T}-{\bf I}\|_{F}^{2}-\frac{1}{\beta_{5}}\log|\det{\bf A}^{T}{\bf A}|)$ (Ravishankar and Bresler, 2014) and (Kostadinov et al., 2018). Under this prior measure, we essentially relate our notion of ”information loss” by constraining the conditioning and the expected coherence of ${\bf A}$ .

Discrimination Prior

We model a discrimination prior as:

[TABLE]

where $f_{c}(\bm{\theta},{\bf y}_{i})$ and $u_{p}(\bm{\theta})$ are NT representation and NT parameters related measures, respectively, that have discrimination role, while $\|{\bf y}_{i}\|_{1}$ is our sparsity measure and $\beta_{1},\beta_{d}$ and $\beta_{E}$ are scaling parameters.

$-$ Compositional Min-max Discrimination Measure To define $f_{c}(\bm{\theta},{\bf y}_{i})$ we assume that:

(i)

The relation between $\bm{\theta}$ and ${\bf y}_{i}$ is determined on the * vector support intersection* between ${\bf y}_{i}$ , $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ 2. (ii)

The min-max description is decomposable w.r.t. $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ 3. (iii)

The support intersection relation is specified based on two measures defined on the support intersection.

We define the two measures $\varrho$ and $\varsigma$ as $\varrho({\bf y}_{i},{\bf y}_{j})=\|{\bf y}_{i}^{-}\odot{\bf y}_{j}^{-}\|_{1}+\|{\bf y}_{i}^{+}\odot{\bf y}_{j}^{+}\|_{1}$ and $\varsigma({\bf y}_{i},{\bf y}_{j})=\|{\bf y}_{i}\odot{\bf y}_{j}\|_{2}^{2}$ , where ${\bf y}_{i}={\bf y}_{i}^{+}-{\bf y}_{i}^{-}$ , ${\bf y}_{j}={\bf y}_{j}^{+}-{\bf y}_{j}^{-}$ , ${\bf y}_{i}^{+}=$ $\max({\bf y}_{i},{\bf 0})$ and ${\bf y}_{i}^{-}=\max(-{\bf y}_{i},$ ${\bf 0})$ . The measure $\varrho({\bf y}_{i},$ ${\bf y}_{j})$ represents our similarity score222When ${\bf y}_{i}^{T}{\bf y}_{j}$ is considered, $\varrho({\bf y}_{i},{\bf y}_{j})$ captures contribution for the similarity, whereas $\|{\bf y}_{i}^{+}\odot{\bf y}_{j}^{-}\|_{1}+\|{\bf y}_{i}^{-}\odot{\bf y}_{j}^{+}\|_{1}$ captures contribution for the dissimilarity between the vectors ${\bf y}_{i}$ and ${\bf y}_{j}$ .. On the other hand, $\varsigma$ measures only the strength on the support intersection. We use these measure to allow a discrimination constraint without any explicit assumption about the space/manifold in the transform domain.

Based on the above assumptions (i), (ii) and (iii), $f_{c}({\bf y}_{i},\bm{\theta})$ is defined as follows:

[TABLE]

The measure (15) ensures that ${\bf y}_{i}$ in the transform domain will be located at the point where:

(a)

The similarity contribution w.r.t. $\bm{\tau}_{c_{1}}$ is the smallest measured w.r.t. $\varrho(.)$ 2. (b)

The strength of the support intersection w.r.t. $\bm{\tau}_{c_{1}}$ is the smallest measured w.r.t. $\varsigma(.)$ 3. (c)

The similarity contribution w.r.t. $\bm{\nu}_{c_{2}}$ is the largest measured w.r.t. $\varrho(.)$ , i.e., smallest w.r.t. $\frac{1}{\varrho(.)}$ .

$-$ Discrimination Parameters Prior Measure The measure $u_{p}(\bm{\theta})$ is defined as:

[TABLE]

The advantage of using (3.3) is that: (i) it allows non-uniform cover of the transform space in arbitrarily coarse or dense way, (ii) it gives a possibility to represents a wide range of transform space regions, including non-continues, continues and overlapping regions and (iii) at the same time it enables $\bm{\theta}$ to be described and concentrated on the most important part of the transform space related to discrimination.

4 Problem Formulation and Learning Algorithm

Minimizing the exact negative logarithm of our learning model $p({\bf Y},{\bf A}|{\bf X})=p({\bf Y}|{\bf X},{\bf A})p({\bf A}|{\bf X})=\prod_{i=1}^{CK}\left[\int_{\bm{\theta}}p({\bf y}_{i},\bm{\theta}|{\bf x}_{i},{\bf A})d\bm{\theta}\right]p({\bf A}|{\bf x}_{i})$ over ${\bf Y},\bm{\theta}$ and ${\bf A}$ is difficult since we have to integrate in order to compute the marginal and the partitioning function of the prior (14).

4.1 Problem Formulation

Instead of minimizing the exact negative logarithm of the marginal $\int_{\bm{\theta}_{est}}p({\bf y}_{i},\bm{\theta}_{est}|{\bf x}_{i},{\bf A})d\bm{\theta}_{est}$ , we consider minimizing the negative logarithm of its maximum point-wise estimate, i.e., $\int_{\bm{\theta}_{est}}p({\bf y}_{i},\bm{\theta}_{est}|{\bf x}_{i},{\bf A})d\bm{\theta}_{est}\leq Dp({\bf y}_{i},\bm{\theta}|{\bf x}_{i},{\bf A})$ , where we assume that $\bm{\theta}$ are the parameters for which $p({\bf y}_{i},\bm{\theta}_{est}|{\bf x}_{i},{\bf A})$ has the maximum value and $D$ is a constant. Furthermore, we use the proportional relation (8) and by disregarding the partitioning function related to the prior (14), we end up with the following problem formulation:

[TABLE]

where $\{2,\lambda_{0},\lambda_{1},\lambda_{2},\lambda_{E}\}$ are parameters inversely proportional to $\{\beta_{0},\beta_{d},\beta_{1},\beta_{a},\beta_{E}\}$ .

4.2 The Learning Algorithm

Note that, solving (17) jointly over ${\bf A}$ , $\bm{\theta}$ and ${\bf Y}$ is again challenging. Alternately, the solution of (17) per any of the variables ${\bf A},\bm{\theta}$ and ${\bf Y}$ can be seen as an integrated marginal maximization (IMM) of $p({\bf Y},{\bf A}|{\bf X})=p({\bf Y}|{\bf X},{\bf A})p({\bf A}|{\bf X})$ that is approximated by $\prod_{i=1}^{CK}p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})p({\bf y}_{i},\bm{\theta})p({\bf A}|{\bf x}_{i})$ , which is equivalent to:

Approximately maximizing with $p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})$ and the prior $p(\bm{\theta},{\bf y}_{i})=p(\bm{\theta}|{\bf y}_{i})p({\bf y}_{i})$ over ${\bf y}_{i}$ 2. 2)

Approximately maximizing with $\prod_{i=1}^{CK}p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})$ and the prior $p(\bm{\theta},{\bf y}_{i})=p({\bf y}_{i}|\bm{\theta})p(\bm{\theta})$ over $\bm{\theta}$ 3. 3)

Approximately maximizing with $\prod_{i=1}^{CK}p({\bf x}_{i}|{\bf y}_{i},\bm{\theta},{\bf A})$ and the prior $p({\bf A})=p({\bf A}|{\bf x}_{i})$ over ${\bf A}$ .

In this sense, based on the IMM principle, we propose an iterative, alternating algorithm that has three stages: (i) cluster and NT representation ${\bf y}_{i}$ assignment, (ii) discrimination parameters ${\bm{\theta}}$ update and (iii) linear map ${\bf A}$ update.

Stage 1: Cluster and NT Representation Assignment

Given the data samples ${\bf X}$ , the current estimate of ${\bf A}$ , $\bm{\theta}$ and ${\bf Q}=$ 333Note that if ${\bf A}={\bf I}$ then ${\bf Q}={\bf X}$ . ${\bf A}{\bf X}=[{\bf q}_{1},...,{\bf q}_{CK}]$ , the NT representations estimation problem is formulated as:

[TABLE]

where $\{\lambda_{0},\lambda_{1}\}$ are inversely proportional to the scaling parameters $\{\beta_{0},\beta_{1}\}$ . Furthermore, given ${\bf q}_{i}$ , and $\bm{\theta}$ , for any ${\bf y}_{i}$ , (18) reduces to a constrained projection problem:

[TABLE]

where we derived the last expression by moving the minimization outwards from $f_{c}({\bf y}_{i},\bm{\theta})\!\!=\!\!\min_{\begin{smallmatrix}c_{1}\in\mathcal{C}_{d}\\ c_{2}\in\mathcal{C}_{s}\end{smallmatrix}}$ $\left[\right.\frac{\varrho({\bf y}_{i},\bm{\tau}_{c_{1}})}{\varrho({\bf y}_{i},\bm{\nu}_{c_{2}})}+\varsigma({\bf y}_{i},\bm{\tau}_{c_{1}})\left.\right]$ , while $|{\bf y}_{i}|$ is a vector whose elements are the absolute values of the elements in ${\bf y}_{i}$ , thus ${\bf 1}^{T}|{\bf y}|=$ $\|{\bf y}\|_{1}$ .

In the following, we propose a solution to (19), which consists of two steps: (i) candidate NT representations estimation and (ii) cluster index and representation assignment.

$-$ * Candidate NT Representations Estimation** * Assuming that per each pair $\{\bm{\tau}_{c_{1}},\bm{\nu}_{c_{2}}\}$ , $\{c_{1},c_{2}\}\in\{\mathcal{C}_{d}\times\mathcal{C}_{s}\}$ , $\varrho(.,\bm{\nu}_{c_{2}})\neq 0$ , then the problem related to candidate NT representation estimation considers only the cost $\left[l_{1}(c_{1},c_{2})+\lambda_{0}s_{P}(c_{1},c_{2})\right]$ from (19) and is defined as:

[TABLE]

The closed form solution to (20) is:

[TABLE]

where

$\odot$ and $\oslash$ are Hadamard product and division, while:

[TABLE]

The variable $e_{c_{1},c_{2}}$ is defined as:

[TABLE]

where $c_{s}=0$ , ${\bf v}_{k,c_{2}}={\bf v}_{c_{2}}\oslash{\bf k}_{c_{1}},{\bf g}_{k,c_{1}}={\bf g}_{c_{1}}\oslash{\bf k}_{c_{1}},|{\bf q}_{k,i}|=|{\bf q}_{i}|\oslash{\bf k}_{c_{1}}$ and $h_{c_{1},c_{2}}$ is a solution to a quartic polynomial (Appendix A).

$-$ * Assignment *

This step consists of two parts.

Part $1$

Given all ${\bf y}|_{\{{c_{1}},{c_{2}}\}}$ , $\{c_{1},c_{2}\}\in\mathcal{C}_{d}\times\mathcal{C}_{s}$ , the first part evaluates a score related to $f_{c}({\bf y}_{i},\bm{\theta})$ as follows:

[TABLE]

Part $2$

In the second part, we assume that the costs $l_{1}(c_{1},c_{2})$ in the respective subproblems (20) across all of the estimated ${\bf y}|_{\{c_{1},c_{2}\}}$ are approximatively equal, i.e.:

[TABLE]

which is a reasonable assumption when the sparsity level $\lambda_{1}$ is same for all ${\bf y}|_{\{c_{1},c_{2}\}},c_{1}\in\mathcal{C}_{d},c_{2}\in\mathcal{C}_{s}$ . Therefore, we disregard $l_{1}(c_{1},c_{2})$ and based on the score (24), we assign the cluster index and the NT representation ${\bf y}_{i}$ as follows:

[TABLE]

where the evaluation w.r.t. $f_{c}({\bf y}_{i},\bm{\theta})$ reduces to computing a minimum score over $s_{P}(.)$ as in (26).

Stage 2: Parameters $\bm{\theta}$ Update

Given the estimated NT representations ${\bf Y}=[{\bf y}_{1},..,{\bf y}_{CM}]$ , the linear map ${\bf A}$ and ${\bf Q}={\bf A}{\bf X}$ , the problem related to update of the parameters $\bm{\theta}$ reduces to the following form:

[TABLE]

where

$\lambda_{E}$ is inversely proportional to $\beta_{E}$ and $u_{p}(\bm{\theta})$ is the measure described in section 3.3. Note that in the cluster and NT representation assignment step (Stage 1, part 2 of our algorithm), for each ${\bf y}_{i}$ the corresponding $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ are known $(A_{ss}):\{{\bf y}_{i},\{{\bf y}|_{\{j_{1}(i),j_{2}(i)\}},\bm{\tau}_{j_{1}(i)},\bm{\nu}_{j_{2}(i)}\}\}$ . Therefore, at this stage, we do not evaluate the terms $f_{c}({\bf y}_{i},\bm{\theta})$ w.r.t. $\bm{\theta}$ . Instead, we use the already evaluated scores based on the assignment w.r.t. ${\bf y}_{i}$ .

In the following, we present the problems related to update of the parameters $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ and comment on the solutions, which represent a slight extension to the previous one.

$-$ * Update Per Single $\bm{\tau}_{c_{1}}$ * Given ${\bf Q}={\bf A}{\bf X}$ , ${\bf Y}$ , ${\bm{\theta}_{\setminus c_{1}}}$ and using $(A_{ss})$ , problem (27), per ${\bm{\tau}_{c_{1}}}$ reduces to:

[TABLE]

The solution for (28) is similar to the solution given by (21), (24) and (26). That is, compared to (21), in the solution of (28), the part related to candidate NT estimation, both the respective thresholding and normalization vectors have additional terms (we give the exact expression and proof in Appendix B.1).

$-$ * Update Per Single $\bm{\nu}_{c_{2}}$ * Given ${\bf Q}={\bf A}{\bf X}$ , ${\bf Y}$ , $\bm{\theta}_{\setminus c_{2}}$ and using $(A_{ss})$ , problem (27), per ${\bm{\nu}_{c_{2}}}$ reduces to:

[TABLE]

In this update

, (29) is solved iteratively, where per each iteration the solution for the candidate NT representation is similar to the solution for (21), but $\bm{\nu}_{c_{2}}$ is estimated using different thresholding and normalization vectors (for the exact expression and proof please see Appendix B.2).

Stage 3: Linear Map ${\bf A}$ Update

Given the data samples ${\bf X}$ , the corresponding transform representations ${\bf Y}$ and the discrimination parameters $\bm{\theta}$ , the problem related to the estimation of the linear map ${\bf A}$ , reduces to:

[TABLE]

where $\|{\bf A}{\bf X}-{\bf Y}_{T}\|_{F}^{2}=\frac{1}{2}\|{\bf A}{\bf X}-{\bf Y}\|_{F}^{2}+\frac{1}{2}\sum_{i=1}^{CK}\|{\bf A}{\bf x}_{i}-\bm{\tau}_{j_{1}(i)}-{\bm{\nu}_{j_{2}(i)}}\|_{2}^{2}$ and ${\bf Y}_{T}$ is expressed as:

[TABLE]

while $\bm{\tau}_{j_{1}(i)}$ and $\bm{\nu}_{j_{2}(i)}$ denote the corresponding $\bm{\tau}_{c_{1}}$ and $\bm{\nu}_{c_{2}}$ that appear in the NT, which is used to estimate ${\bf y}_{i}$ , $\forall i\in\{1,...,CK\}$ and the parameters $\{\lambda_{2},\lambda_{3},\lambda_{4}\}$ are inversely proportional to the scaling parameters $\{\beta_{3},\beta_{4},\beta_{5}\}$ . We solve (30) using an approximate closed form solution as proposed in (Kostadinov et al., 2018).

We point out that when ${\bf A}={\bf I}$ , then this stage is omitted and our learning algorithm reduces to alternating between cluster and representation ${\bf Y}$ assignment and update of $\bm{\theta}$ .

5 Approach Evaluation

This section evaluates the advantages and the potential of the proposed algorithm and compares its clustering performance to the state-of-the-art methods.

5.1 Data, Setup and Measures

Data Sets

The used data sets are E-YALE-B (Georghiades et al., 2001), AR (Martínez and Benavente, 1998), ORL (Samaria et al., 1994) and COIL (Nene et al., 1996). All the images from the respective datasets were downscaled to resolutions $21\times 21$ , $32\times 28$ , $24\times 24$ and $20\times 25$ , respectively, and are normalized to unit variance.

Algorithm and Clustering Set Up

The used setup is described in the following text.

$-$ * On-Line Version* An on-line variant is used for the update of ${\bf A}$ w.r.t. a subset of the available training set. It has the following form ${\bf A}^{t+1}={\bf A}^{t}-\rho({\bf A}^{t}-\hat{{\bf A}})$ where $\hat{{\bf A}}$ and ${\bf A}^{t}$ are the the solutions in the transform update step at iterations $t+1$ and $t$ , which is equivalent to having the additional constraint ${\|{\bf A}^{t}-\hat{{\bf A}}\|_{F}^{2}}$ in the related problem. The used batch size is equal to $87\%,85\%,90\%$ and $87\%$ of the total amount of the available training data from the respective datasets E-YALE-B, AR, ORL and COIL.

$-$ * Clustering Setup, Cluster Index and NT Estimation * We assume that the number of clusters $C$ per database is known. We set the number of parameters that are related to dissimilarity $\bm{\tau}_{c_{1}},c_{1}\in\{1,...,C_{d}\}$ to be close to the number of actual clusters $C$ , i.e., $C_{d}=C$ and we set the number of parameters $\bm{\nu}_{c_{2}},c_{2}\in\{1,...,C_{s}\}$ related to similarity to be small, i.e., $C_{s}$ is small. The cluster index $c$ and the NT are estimated based on the minimum score of the discriminative functional measure as explained in Section 4.2. As an evaluation metric for the clustering performance we use the cluster accuracy (CA) and the normalized mutual information (NMI) (Cai et al., 2011).

$-$ * Algorithm Set-up* The parameters $\lambda_{0}=\lambda_{1}=0.03$ , $\lambda_{E}=0.001$ , $\lambda_{2}=\lambda_{3}=\lambda_{4}=16$ , the transform dimension is $M=2100$ . The algorithm is initialized with ${\bf A}$ and $\bm{\theta}$ having i.i.d. Gaussian (zero mean, unit variance) entries and is terminated after the $100$ th iteration. The results are obtained as the average of $5$ runs.

5.2 Numerical Experiments

Summary

Our experiments consist of three parts.

$-$ * NT Properties* In the first series of the experiments, we investigate the properties of the proposed algorithm. We measure the run time $t$ of the proposed algorithm, the conditioning number ${\kappa}_{n}({\bf A})=\frac{\sigma_{max}}{\sigma_{min}}$ ( $\sigma_{min}$ and $\sigma_{max}$ are the smallest and the largest singular values of ${\bf A}$ , respectively) and the expected mutual coherence $\mu({\bf A})$ as in (Kostadinov et al., 2018) of the shared linear map ${\bf A}$ in the learned NTs.

$-$ * Clustering and k-NN Classification Performance* In the second part, we measure the performance across all databases and report the CA and NMI. We also split every databases on training and test set and learn NTs with the proposed algorithm on the training set. We use the learned NTs to assign a representation for the test data and then preform a k-NN (Cover and Thomas, 2006) search using the test NT representation on the training NT representation.

$-$ * Proposed Method vs State-Of-The-Art* This part compares the proposed method w.r.t. results reported by five state-of-the-art methods, including: GSC (Zheng et al., 2011), NSLRR (Yin et al., 2016), SDRAM (Guo, 2015) and RGRSC(Kodirov et al., 2016).

Evaluation Results

We show the results in Tables 1, 2, 3, 4 and 5.

$-$ NT Properties As shown in Table 1, the learned NTs for all the data sets have relatively low computational time per iteration. All linear maps in the NTs have good conditioning numbers and low expected coherence.

$-$ Clustering Performance The results of the clustering performance over the databases E-YALE-B (Georghiades et al., 2001), AR (Martínez and Benavente, 1998), ORL (Samaria et al., 1994) and COIL (Nene et al., 1996) are shown in Table 2. We see that both the CA and the NMI measures have high values. The highest performance is reported on the E-YALE-B (Georghiades et al., 2001) databases where the CA and NMI are $96.8\%$ and $95.3\%$ , respectively.

$-$ Proposed vs State-Of-The-Art Clustering The results are shown on Tables 3 and 4. As we see the proposed algorithm outperforms the state-of-the-art methods CASS (Lu et al., 2013), GSC (Zheng et al., 2011), NSLRR (Yin et al., 2016), SDRAM (Guo, 2015) and RGRSC(Kodirov et al., 2016). The highest gain in CA and NMI w.r.t. the state-of-the-art is $1.6\%$ and $1.9\%$ , respectively, that is achieved on the E-YALE-B (Georghiades et al., 2001) and the COIL (Nene et al., 1996) databases, respectively.

$-$ k-NN Classification Performance The results of the k-NN performance on all databases is shown in Table 5. As a baseline we use k-NN on the original data and report improvements of $3.1\%$ , $2.4\%$ , $3.3\%$ and $4.4\%$ over the baseline results for the respective databases.

6 Conclusion

In this paper, we modeled assignment based NT with priors. A novel clustering concept was introduced where we (i) jointly learn the NTs with priors and (ii) assign the cluster and the NT representation based on maximum likelihood over functional measure. Given the observed data, an empirical approximation to the maximum likelihood of the model gives the corresponding problem formulation. We proposed an efficient solution for learning the model parameters by a low complexity iterative alternating algorithm.

The proposed algorithm was evaluated on publicly available databases. The preliminary results showed promising performance. In a clustering regime w.r.t. the used CA and NMI measures, the algorithm gives improvements compared to the state-of-the-art methods. In unsupervised k-NN classification regime, it demonstrated high classification accuracy.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aharon et al. (2006) M. Aharon, M. Elad, and A. Bruckstein. Svdd: An algorithm for designing overcomplete dictionaries for sparse representation. Trans. Sig. Proc. , 54(11):4311–4322, November 2006.
2Bach and Harchaoui (2008) Francis R Bach and Zaïd Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems , pages 49–56, 2008.
3Baldi (2011) Pierre Baldi. Autoencoders, unsupervised learning and deep architectures. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27 , UTLW’11, pages 37–50. JMLR.org, 2011.
4Bradley and Mangasarian (2000) Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization , 16(1):23–32, 2000.
5Cai et al. (2011) Deng Cai, Xiaofei He, Jiawei Han, and Thomas S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. , 33(8):1548–1560, August 2011.
6Cai et al. (2014) Sijia Cai, Wangmeng Zuo, Lei Zhang, Xiangchu Feng, and Ping Wang. Support vector guided dictionary learning. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV , pages 624–639, 2014.
7Child (2006) D. Child. The Essentials of Factor Analysis . Bloomsbury Academic, 2006.
8Chitta et al. (2011) Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’11, pages 895–903, New York, NY, USA, 2011. ACM.