Clustering with Jointly Learned Nonlinear Transforms Over Discriminating Min-Max Similarity/Dissimilarity Assignment
Dimche Kostadinov, Behrooz Razeghi, Taras Holotyak, Slava, Voloshynovskiy

TL;DR
This paper introduces a novel clustering method that jointly learns nonlinear transforms with priors, using a min-max measure for discriminative assignment, demonstrating improved performance on image clustering tasks.
Contribution
It proposes a new clustering framework based on jointly learned nonlinear transforms and a min-max discriminative measure, enhancing clustering accuracy.
Findings
Outperforms state-of-the-art clustering methods on image datasets
Demonstrates the effectiveness of jointly learned nonlinear transforms
Validates the approach through numerical experiments
Abstract
This paper presents a novel clustering concept that is based on jointly learned nonlinear transforms (NTs) with priors on the information loss and the discrimination. We introduce a clustering principle that is based on evaluation of a parametric min-max measure for the discriminative prior. The decomposition of the prior measure allows to break down the assignment into two steps. In the first step, we apply NTs to a data point in order to produce candidate NT representations. In the second step, we preform the actual assignment by evaluating the parametric measure over the candidate NT representations. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods.
| COIL | ORL | E-YALE-B | AR | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | .2e-5 | 46 | 21 | .3e-5 | 48 | 31 | .1e-5 | 51 | 28 | .3e-5 | 69 | ||
| COIL | ORL | E-YALE-B | AR | |
|---|---|---|---|---|
| CA | ||||
| NMI |
| COIL | ORL | E-YALE-B | AR | |
|---|---|---|---|---|
| acc. NT | ||||
| acc. OD |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Neural Networks and Applications · Remote-Sensing Image Classification
Clustering with Jointly Learned Nonlinear Transforms Over
Discriminating Min-Max Similarity/Dissimilarity Assignment
Dimche Kostadinov
Behrooz Razeghi
Taras Holotyak
Slava Voloshynovskiy
Abstract
This paper presents a novel clustering concept that is based on jointly learned nonlinear transforms (NTs) with priors on the information loss and the discrimination. We introduce a clustering principle that is based on evaluation of a parametric min-max measure for the discriminative prior. The decomposition of the prior measure allows to break down the assignment into two steps. In the first step, we apply NTs to a data point in order to produce candidate NT representations. In the second step, we preform the actual assignment by evaluating the parametric measure over the candidate NT representations. Numerical experiments on image clustering task validate the potential of the proposed approach. The evaluation shows advantages in comparison to the state-of-the-art clustering methods.
Machine Learning, ICML
1 Introduction
Clustering is one of the most important unsupervised learning task in the areas of signal processing, machine learning, computer vision and artificial intelligence that has been extensively studied for decades. Commonly, the data clustering algorithms (Cover and Thomas, 2006), (Hoyer and Dayan, 2004), (Guo et al., 2012), (Jiang et al., 2013), (Cai et al., 2014), (Shekhar et al., 2014), (Xu et al., 2005), (Bach and Harchaoui, 2008) and (Krause et al., 2010) address the problem of identification and description of the underlining clusters that explain the data.
Among the various types of clustering algorithms, the k-means and matrix decomposition based methods are one of the most popular and practically useful approaches. Given a data set, in the most common case, the objective of a clustering algorithm is to minimize the inter-cluster cost, i.e., the measured similarity between the data cluster and the data points under that cluster and maximize the intra-cluster cost, i.e., the measured similarity between the data cluster and the data points that do not belong to that cluster. A data factorization/decomposition model (Cover and Thomas, 2006) (Hoyer and Dayan, 2004) (Krause et al., 2010) (Vidal, 2011) with constraints summarizes a general problem formulation that also subsumes the previously explained basic case. We express it in the following:
[TABLE]
where are the clusters, is the -th data point, is its data representation over the clusters, are parameters responsible for a tasks specific functionality, is the similarity measure between the data point and the representation over the clusters, and are the task specific and sparsity penalty functions, respectively, is penalty on the cluster properties and are Lagrangian parameters. The cluster assignment in (1) is based on the synthesis model (Aharon et al., 2006), (R. et al., 2013), where usually is reconstructed and represented by a sparse linear combination over the clusters as . In essence, the crucial element behind this clustering principle is the used measure for similarity as well as the penalty functions , and which have significant role in the cluster vectors estimation and impact the resulting cluster assignment. Due to the used model, the solution to (1) might not only have high computational complexity, but also there might be difficulties in modeling and imposing constraints (, and in (1)) that are requered in order to preserve specific data properties, like structured sparsity (Hoyer and Dayan, 2004), pairwise constraints (Shekhar et al., 2014), data subspace (Elhamifar and Vidal, 2009), (Vidal, 2011), (Lu et al., 2012), graph structure and manifold curvature (Krause et al., 2010) and (Daitch et al., 2009).
On the other hand, beside the synthesis model, in the area of signal processing, the other two commonly used models are the analysis model (Rubinstein et al., 2013) and the sparsyfying transform model (Rubinstein and Elad, 2014) and (Ravishankar and Bresler, 2014). In the transform model, is a nonlinear transform (NT) representation that is estimated using a linear mapping , with map , which then is followed by an element-wise nonlinearity and it represents a solution to a constrained projection problem. Under this model, the computational complexity for estimating the representation is low, but, so far, was not addressed, nor considered as basis for clustering or learning discriminative NT representations. In addition, in spite the fact that an NT model offers a high degree of freedom for modeling a wide class of constraints111Many nonlinearities, i.e., ReLu, -norms, elastic net-like, -norm ratio, binary encoding, ternary encoding, etc., can be expressed and modeled by a nonlinear transform., robust assignment cost under NT model that is based on a parametric measure which jointly takes into account not only similarity, but also dissimilarity contribution, was not studied nether explored.
1.1 Nonlinear Transform Model, Assignment Principle and Learning Strategy Outline
In this paper, we introduce an assignment based nonlinear transform model for clustering.
Assignment Based NT Model We addresses the problem of estimating the parameters that model the probability of assigning a cluster and a nonlinear transform representation for input data by using the parameters and .
We motivate the use of , in order to extract useful data properties. If , a suitable prior allows us to model a metric (linear map) to achieve invariance (Li et al., 2016). In this paper, we model overcomplete with using the prior . Essentially, with our prior, we introduce redundancy in a constrained way, while we approximatively preserve the properties of the original data in order to pronounce discrimination among the assigned NT representations . Nonetheless, we note that even in the case of , we can use our model, which reduces to a nonlinear assignment in the original space of , i.e., . Our assignment measure is nonlinear. In general, not all nonlinear function can highlight relevant data properties that are related to discrimination. We consider a piece-wise linear nonlinarity. In order to address the robustness in the assignment, we explicitly model parameters related to both similarity and dissimilarity contribution. Their role is the discrimination functionality that we address by using a composite min-max assignment measure. In its evaluation, due to the decomposition of the min-max assignment measure, the pair from has additional interpretation. It is viewed as NT specific parameter, which is used to produce the respective candidate NT representation.
Cluster and NT Representation Assignment During cluster assignment, instead of description by clusters, we rely on candidate NTs. We estimate a single candidate NT representation as a solution to a direct, constrained projection problem. To attain unique and distinctive patterns in the resulting candidate NT representation, we parameterize the corresponding candidate NT with shared and distinct that are used for the element-wise nonlinearity.
When (1) is used, has only one representation that usually under certain similarity score is related to the likelihood of the assignment w.r.t. clusters. On the other hand, in our model, we apply a number of candidate NTs on a data point , which results in a number of candidate NT representations. Afterwords, our assignment is based on evaluating a min-max similarity/dissimilarity score using all of the candidate NT representations and their corresponding parameters . Nonetheless, based on the same assignment score, we describe by only one candidate NT representation. In fact, in this way, we simultaneously assign both the cluster index and the NT representation based on the evaluation of the min-max measure.
Learning Strategy In order to estimate the parameters of our model, we consider . Its maximization over and is difficult. We address a point-wise approximation to the marginal , which allows us to derive an efficient learning algorithm.
Compared with the factorization based clustering methods (1), the fundamental difference of our approach is the used model. The factorization/decomposition model addresses the joint data reconstruction and cluster estimation with constraints by solving inverse problem, w.r.t. the model , where is the error vector. In our assignment based nonlinear transform model, we address joint learning of data projections (NTs) with information loss and discriminative priors, by solving direct, constrained projection problems, based on the candidate models in the form , where is the NT error vector.
1.2 Contributions
In the following, we outline our contributions.
(i) We introduce novel cluster assignment principle that is centered on two elements: (1) joint modeling and learning of nonlinear transforms (NTs) with priors and (2) cluster and NT representation assignment based on a min-max score. To the best of our knowledge, our novel discriminative assignment principle is first of this kind that:
- (a)
Introduces a clustering concept that is based on modeling a direct problem 2. (b)
Addresses a trade-off between robustness in the cluster assignment and the NT representaion compactness by allowing reduction or extension of the NT dimensionality while increasing or decreasing the number of the discrimination parameters 3. (c)
Offers cluster assignment over a wide class of similarity score functions including a min-max while enabling efficient estimation of the NT representation 4. (d)
Allows a rejection option and cluster grouping over continues, discontinues and overlapping regions in the transform domain.
(ii) We propose an efficient learning strategy in order to estimate the parameters of the NTs. We implement it by an iterative alternating algorithm with three steps. At each step we give an exact and approximate closed form solution.
(iii) We present numerical experiments that validate our model and learning principle on several image data sets. Our preliminary results on an image clustering task demonstrate advantages in comparison to the state-of-the-art methods, w.r.t. the computational efficiency in training and test time and the used clustering performance measures.
2 Related Work
In the following, we describe the related prior work.
K-means, Matrix Factorization Models and Dictionary Learning Factor analysis (Child, 2006) and matrix factorization (Hoyer and Dayan, 2004) relay on decomposition on hidden features without or with constraints. One special case with only a constraint on the sparsity of the hidden representation, which is considered as a ”hard” assignment is the basic k-means (Cover and Thomas, 2006) algorithm. When discrimination constraints are present, they act as regularization, which were mainly defined using labels in the discriminative dictionary learning methods (Jiang et al., 2013), (Cai et al., 2014) and (Shekhar et al., 2014).
Kernel, Subspace and Manifold Based Clustering Intended to capture the nonlinear structure of the data with outliers and noise, the kernel k-means algorithms (Dhillon et al., 2004) and (Chitta et al., 2011) have been proposed. Also, many subspace clustering methods were proposed (Vidal, 2011), (Ma et al., 2008), (Ma et al., 2007), (Lu et al., 2012), (Elhamifar and Vidal, 2009) and (Bradley and Mangasarian, 2000). Commonly they consist of (i) subspace learning via matrix factorization and (ii) grouping of the data into clusters in the learned subspace. Some authors (Daitch et al., 2009) even include a graph regularization into the subspace clustering.
Discriminative Clustering In (Xu et al., 2005) clustering with maximum margin constraints was proposed. The authors in (Bach and Harchaoui, 2008) proposed linear clustering based on a linear discriminative cost function with convex relaxation. In (Krause et al., 2010) regularized information maximization was proposed and simultaneous clustering and classifier training was preformed. The above methods rely on kernels and account high computational complexity.
Self-Supervision, Self-Organization and Auto-Encoders In self-supervised learning (Doersch et al., 2015), (Pathak et al., 2016) the input data determine the labels. In self-organization (Kohonen, 1982), (Vesanto and Alhoniemi, 2000) a neighborhood function is used to preserve the topological properties of the input space. Both of the approaches leverage implicit discrimination using the data. The single layer auto-encoder (Baldi, 2011) and its denoising extension (Vincent et al., 2010) consider robustness to noise and reconstruction. While the idea is to encode and decode the data using a reconstruction loss, an explicit constraint that enforces discrimination is not addressed.
3 Assignment Based NT With Priors
In the following, we introduce our deterministic model and show how and are used in the NTs to produce the candidate representations and how we perform cluster and NT representation assignment. Given and , we model an assignment over candidate nonlinear transforms:
[TABLE]
which are defined by the corresponding set of parameters:
[TABLE]
All of the candidate nonlinear transforms in the set share the linear map and have distinct and . A single from the set is indexed using the index pair or using the single index computed as , and .
A compact description of our assignment model that takes into account candidate NT representations while evaluating a parametric discrimination score is the following:
[TABLE]
[TABLE]
where and are measures, which will be explained in details in the following subsection. A single candidate nonlinear transform model is defined as follows:
[TABLE]
and is the parametric candidate NT that produces the candidate NT representation , by using the set of parameters , while is the NT error vector.
3.1 The Probabilistic Assignment Based NT Model
In probability, we use the following model:
[TABLE]
Furthermore, we use the Bayes’ rule, disregard the prior and focus on the proportional relation, i.e.:
[TABLE]
In our model, the probability takes into account the NT error and the discrimination parameter adjustment error. While is the parametric discriminative prior, which we simplify it by the following assumption:
[TABLE]
where we disregard the dependences on . We denote the discrimination related parameters as:
[TABLE]
which are used in our min-max assignment over dissimilarity and similarity contributions w.r.t. all pairs and .
3.2 Assignment and Adjustment Modeling
We model the assignment based nonlinear transform together with the discrimination parameters using the r.h.s. of (8).
Using two measures, we define as follows:
[TABLE]
where and take into account the NT and the discrimination parameter adjustment errors, respectively, while and are scaling parameters.
NT Error Note that in (4), is evaluated using an assignment over the candidate NT representations that result from applying the nonlinear transform on . Therefore, we can say that the term , i.e., is the nonlinear transform error vector that represents the deviation of from the targeted transform representation . In the simplest form, we assume to be Gaussian distributed and we model:
[TABLE]
NT Parameters Adjustment We assume that the adjustment of any of the NT discrimination parameters is w.r.t. the following measure:
[TABLE]
where and denote the indexes of the corresponding and that are related to the assigned from the set of candidate representations under model (4) with the criteria (5).
With (13) and the discrimination parameter prior (that we will describe in the next subsection), we assume that the linear transform representation decomposes into two distinct components, which respectively are related to the dissimilarity and similarity parameters and . The prior (13) is crucial for a proper adjustment between the linear mapping and the pairs as well as for enabling the candidate NTs to discriminate in the transform domain based on their respective pair .
3.3 Priors Modeling
A prior is used to allow adequate regularization of the coherence and conditioning on the transform matrix , whereas the joint modeling of the NTs is enabled by using the prior .
Minimum Information Loss Prior
By the term ”minimum information loss” we mean that the linear map approximatively preserves the data properties in the transform space. In order to simplify, we assume that and define our prior as , where (Ravishankar and Bresler, 2014) and (Kostadinov et al., 2018). Under this prior measure, we essentially relate our notion of ”information loss” by constraining the conditioning and the expected coherence of .
Discrimination Prior
We model a discrimination prior as:
[TABLE]
where and are NT representation and NT parameters related measures, respectively, that have discrimination role, while is our sparsity measure and and are scaling parameters.
Compositional Min-max Discrimination Measure To define we assume that:
- (i)
The relation between and is determined on the * vector support intersection* between , and 2. (ii)
The min-max description is decomposable w.r.t. and 3. (iii)
The support intersection relation is specified based on two measures defined on the support intersection.
We define the two measures and as and , where ,, and . The measure represents our similarity score222When is considered, captures contribution for the similarity, whereas captures contribution for the dissimilarity between the vectors and .. On the other hand, measures only the strength on the support intersection. We use these measure to allow a discrimination constraint without any explicit assumption about the space/manifold in the transform domain.
Based on the above assumptions (i), (ii) and (iii), is defined as follows:
[TABLE]
The measure (15) ensures that in the transform domain will be located at the point where:
- (a)
The similarity contribution w.r.t. is the smallest measured w.r.t. 2. (b)
The strength of the support intersection w.r.t. is the smallest measured w.r.t. 3. (c)
The similarity contribution w.r.t. is the largest measured w.r.t. , i.e., smallest w.r.t. .
Discrimination Parameters Prior Measure The measure is defined as:
[TABLE]
[TABLE]
The advantage of using (3.3) is that: (i) it allows non-uniform cover of the transform space in arbitrarily coarse or dense way, (ii) it gives a possibility to represents a wide range of transform space regions, including non-continues, continues and overlapping regions and (iii) at the same time it enables to be described and concentrated on the most important part of the transform space related to discrimination.
4 Problem Formulation and Learning Algorithm
Minimizing the exact negative logarithm of our learning model over and is difficult since we have to integrate in order to compute the marginal and the partitioning function of the prior (14).
4.1 Problem Formulation
Instead of minimizing the exact negative logarithm of the marginal , we consider minimizing the negative logarithm of its maximum point-wise estimate, i.e., , where we assume that are the parameters for which has the maximum value and is a constant. Furthermore, we use the proportional relation (8) and by disregarding the partitioning function related to the prior (14), we end up with the following problem formulation:
[TABLE]
where are parameters inversely proportional to .
4.2 The Learning Algorithm
Note that, solving (17) jointly over , and is again challenging. Alternately, the solution of (17) per any of the variables and can be seen as an integrated marginal maximization (IMM) of that is approximated by , which is equivalent to:
Approximately maximizing with and the prior over 2. 2)
Approximately maximizing with and the prior over 3. 3)
Approximately maximizing with and the prior over .
In this sense, based on the IMM principle, we propose an iterative, alternating algorithm that has three stages: (i) cluster and NT representation assignment, (ii) discrimination parameters update and (iii) linear map update.
Stage 1: Cluster and NT Representation Assignment
Given the data samples , the current estimate of , and 333Note that if then ., the NT representations estimation problem is formulated as:
[TABLE]
where are inversely proportional to the scaling parameters . Furthermore, given , and , for any , (18) reduces to a constrained projection problem:
[TABLE]
where we derived the last expression by moving the minimization outwards from , while is a vector whose elements are the absolute values of the elements in , thus .
In the following, we propose a solution to (19), which consists of two steps: (i) candidate NT representations estimation and (ii) cluster index and representation assignment.
* Candidate NT Representations Estimation** * Assuming that per each pair ,, , then the problem related to candidate NT representation estimation considers only the cost from (19) and is defined as:
[TABLE]
[TABLE]
[TABLE]
The closed form solution to (20) is:
[TABLE]
where
and are Hadamard product and division, while:
[TABLE]
The variable is defined as:
[TABLE]
where , and is a solution to a quartic polynomial (Appendix A).
* Assignment *
This step consists of two parts.
Part
Given all , , the first part evaluates a score related to as follows:
[TABLE]
Part
In the second part, we assume that the costs in the respective subproblems (20) across all of the estimated are approximatively equal, i.e.:
[TABLE]
which is a reasonable assumption when the sparsity level is same for all . Therefore, we disregard and based on the score (24), we assign the cluster index and the NT representation as follows:
[TABLE]
where the evaluation w.r.t. reduces to computing a minimum score over as in (26).
Stage 2: Parameters Update
Given the estimated NT representations , the linear map and , the problem related to update of the parameters reduces to the following form:
[TABLE]
where
is inversely proportional to and is the measure described in section 3.3. Note that in the cluster and NT representation assignment step (Stage 1, part 2 of our algorithm), for each the corresponding and are known . Therefore, at this stage, we do not evaluate the terms w.r.t. . Instead, we use the already evaluated scores based on the assignment w.r.t. .
In the following, we present the problems related to update of the parameters and and comment on the solutions, which represent a slight extension to the previous one.
* Update Per Single * Given , , and using , problem (27), per reduces to:
[TABLE]
The solution for (28) is similar to the solution given by (21), (24) and (26). That is, compared to (21), in the solution of (28), the part related to candidate NT estimation, both the respective thresholding and normalization vectors have additional terms (we give the exact expression and proof in Appendix B.1).
* Update Per Single * Given , , and using , problem (27), per reduces to:
[TABLE]
In this update
, (29) is solved iteratively, where per each iteration the solution for the candidate NT representation is similar to the solution for (21), but is estimated using different thresholding and normalization vectors (for the exact expression and proof please see Appendix B.2).
Stage 3: Linear Map Update
Given the data samples , the corresponding transform representations and the discrimination parameters , the problem related to the estimation of the linear map , reduces to:
[TABLE]
where and is expressed as:
[TABLE]
while and denote the corresponding and that appear in the NT, which is used to estimate , and the parameters are inversely proportional to the scaling parameters . We solve (30) using an approximate closed form solution as proposed in (Kostadinov et al., 2018).
We point out that when , then this stage is omitted and our learning algorithm reduces to alternating between cluster and representation assignment and update of .
5 Approach Evaluation
This section evaluates the advantages and the potential of the proposed algorithm and compares its clustering performance to the state-of-the-art methods.
5.1 Data, Setup and Measures
Data Sets
The used data sets are E-YALE-B (Georghiades et al., 2001), AR (Martínez and Benavente, 1998), ORL (Samaria et al., 1994) and COIL (Nene et al., 1996). All the images from the respective datasets were downscaled to resolutions , , and , respectively, and are normalized to unit variance.
Algorithm and Clustering Set Up
The used setup is described in the following text.
* On-Line Version* An on-line variant is used for the update of w.r.t. a subset of the available training set. It has the following form where and are the the solutions in the transform update step at iterations and , which is equivalent to having the additional constraint in the related problem. The used batch size is equal to and of the total amount of the available training data from the respective datasets E-YALE-B, AR, ORL and COIL.
* Clustering Setup, Cluster Index and NT Estimation * We assume that the number of clusters per database is known. We set the number of parameters that are related to dissimilarity to be close to the number of actual clusters , i.e., and we set the number of parameters related to similarity to be small, i.e., is small. The cluster index and the NT are estimated based on the minimum score of the discriminative functional measure as explained in Section 4.2. As an evaluation metric for the clustering performance we use the cluster accuracy (CA) and the normalized mutual information (NMI) (Cai et al., 2011).
* Algorithm Set-up* The parameters , , , the transform dimension is . The algorithm is initialized with and having i.i.d. Gaussian (zero mean, unit variance) entries and is terminated after the th iteration. The results are obtained as the average of runs.
5.2 Numerical Experiments
Summary
Our experiments consist of three parts.
* NT Properties* In the first series of the experiments, we investigate the properties of the proposed algorithm. We measure the run time of the proposed algorithm, the conditioning number ( and are the smallest and the largest singular values of , respectively) and the expected mutual coherence as in (Kostadinov et al., 2018) of the shared linear map in the learned NTs.
* Clustering and k-NN Classification Performance* In the second part, we measure the performance across all databases and report the CA and NMI. We also split every databases on training and test set and learn NTs with the proposed algorithm on the training set. We use the learned NTs to assign a representation for the test data and then preform a k-NN (Cover and Thomas, 2006) search using the test NT representation on the training NT representation.
* Proposed Method vs State-Of-The-Art* This part compares the proposed method w.r.t. results reported by five state-of-the-art methods, including: GSC (Zheng et al., 2011), NSLRR (Yin et al., 2016), SDRAM (Guo, 2015) and RGRSC(Kodirov et al., 2016).
Evaluation Results
We show the results in Tables 1, 2, 3, 4 and 5.
NT Properties As shown in Table 1, the learned NTs for all the data sets have relatively low computational time per iteration. All linear maps in the NTs have good conditioning numbers and low expected coherence.
Clustering Performance The results of the clustering performance over the databases E-YALE-B (Georghiades et al., 2001), AR (Martínez and Benavente, 1998), ORL (Samaria et al., 1994) and COIL (Nene et al., 1996) are shown in Table 2. We see that both the CA and the NMI measures have high values. The highest performance is reported on the E-YALE-B (Georghiades et al., 2001) databases where the CA and NMI are and , respectively.
Proposed vs State-Of-The-Art Clustering The results are shown on Tables 3 and 4. As we see the proposed algorithm outperforms the state-of-the-art methods CASS (Lu et al., 2013), GSC (Zheng et al., 2011), NSLRR (Yin et al., 2016), SDRAM (Guo, 2015) and RGRSC(Kodirov et al., 2016). The highest gain in CA and NMI w.r.t. the state-of-the-art is and , respectively, that is achieved on the E-YALE-B (Georghiades et al., 2001) and the COIL (Nene et al., 1996) databases, respectively.
k-NN Classification Performance The results of the k-NN performance on all databases is shown in Table 5. As a baseline we use k-NN on the original data and report improvements of , , and over the baseline results for the respective databases.
6 Conclusion
In this paper, we modeled assignment based NT with priors. A novel clustering concept was introduced where we (i) jointly learn the NTs with priors and (ii) assign the cluster and the NT representation based on maximum likelihood over functional measure. Given the observed data, an empirical approximation to the maximum likelihood of the model gives the corresponding problem formulation. We proposed an efficient solution for learning the model parameters by a low complexity iterative alternating algorithm.
The proposed algorithm was evaluated on publicly available databases. The preliminary results showed promising performance. In a clustering regime w.r.t. the used CA and NMI measures, the algorithm gives improvements compared to the state-of-the-art methods. In unsupervised k-NN classification regime, it demonstrated high classification accuracy.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aharon et al. (2006) M. Aharon, M. Elad, and A. Bruckstein. Svdd: An algorithm for designing overcomplete dictionaries for sparse representation. Trans. Sig. Proc. , 54(11):4311–4322, November 2006.
- 2Bach and Harchaoui (2008) Francis R Bach and Zaïd Harchaoui. Diffrac: a discriminative and flexible framework for clustering. In Advances in Neural Information Processing Systems , pages 49–56, 2008.
- 3Baldi (2011) Pierre Baldi. Autoencoders, unsupervised learning and deep architectures. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27 , UTLW’11, pages 37–50. JMLR.org, 2011.
- 4Bradley and Mangasarian (2000) Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization , 16(1):23–32, 2000.
- 5Cai et al. (2011) Deng Cai, Xiaofei He, Jiawei Han, and Thomas S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. , 33(8):1548–1560, August 2011.
- 6Cai et al. (2014) Sijia Cai, Wangmeng Zuo, Lei Zhang, Xiangchu Feng, and Ping Wang. Support vector guided dictionary learning. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV , pages 624–639, 2014.
- 7Child (2006) D. Child. The Essentials of Factor Analysis . Bloomsbury Academic, 2006.
- 8Chitta et al. (2011) Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’11, pages 895–903, New York, NY, USA, 2011. ACM.
