Goodness-of-fit Test for Latent Block Models
Chihiro Watanabe, Taiji Suzuki

TL;DR
This paper introduces a new statistical goodness-of-fit test for latent block models, enabling validation of cluster numbers in relational data matrices, extending prior methods limited to symmetric stochastic block models.
Contribution
The study develops the first goodness-of-fit test for latent block models, applicable to non-symmetric matrices with separate row and column clusters, using random matrix theory.
Findings
Test statistic exhibits expected asymptotic behavior
Method accurately determines the correct number of clusters
Effective in various simulated data scenarios
Abstract
Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference
Goodness-of-fit Test for Latent Block Models
Chihiro Watanabe [email protected] Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan
Taiji Suzuki [email protected] Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan
Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan
Abstract
Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the effectiveness of the proposed method by showing the asymptotic behavior of the test statistic and measuring the test accuracy.
1 Introduction
Block modeling [18, 2] is known to be effective in representing various relational data sets, such as the data sets of movie ratings [44], customer-product transactions [44], congressional voting [27], document-word relationships [12], and gene expressions [40]. Latent block models or LBMs [17] are used for probabilistic biclustering of such relational data matrices, where rows and columns represent different objects. For instance, suppose that a matrix represents the relationship between users and movies, where entry is the rating of the th movie by the th user. In LBMs, we assume a regular-grid block structure behind the observed matrix ; i.e., both rows (users) and columns (movies) of matrix are simultaneously decomposed into latent clusters. A block is defined as a combination of row and column clusters, and entries of the same block in matrix are supposed to be i.i.d. random variables.
An open problem in using LBMs is that there has been no statistical criterion for determining the numbers of row and column clusters. Recently, statistical-test-based approaches [5, 29, 21] have been proposed for estimating the cluster number of stochastic block models (SBMs) [19]. SBMs are similar to LBMs in the sense that they assume a block structure behind an observed matrix; however, they are based on different assumptions from LBMs that an observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns [33]. In regard to the LBM setting, no statistical method has been constructed to determine row and column cluster numbers.
Aside from the test-based methods, several model selection approaches have been proposed based on cross-validation [8] or an information criterion [26, 38, 27]. However, these approaches have several limitations. (1) First, they cannot provide knowledge about the reliability of the result besides the finally estimated cluster numbers. Rather than minimizing the generalization error, in some cases, it is more appropriate to provide a probabilistic guarantee in reliability for the purpose of knowledge discovery. (2) Second, both the cross-validation-based and information-criterion-based methods depend on the clustering algorithm used. For instance, we can employ the Bayesian information criterion (BIC) for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models. Constructing an information criterion that estimates the expectation of the generalization error for a wider class of models is generally difficult. (3) Finally, the above methods require relatively large computational complexity. Computation of an information criterion requires the process of approximating the posterior distribution by the Markov chain Monte Carlo (MCMC) method, and cross-validation requires the iterative calculation of the test error with different sets of partitions of the training and test data sets.
In this study, we proposed a new statistical test method for LBMs. To construct a hypothesis test with a theoretical guarantee, we used a result from random matrix theory. Recent studies on random matrix theory have revealed the asymptotic behavior of singular values of an random matrix [16, 45, 52, 3, 22, 23, 46, 37, 39, 4, 13]. Here, we assume that each entry of matrix , which is given by (which is computed by the original matrix , its block-wise mean and standard deviation ) follows a distribution with a sub-exponential decay. From the result in [39], the normalized maximum eigenvalue of converges in law to the Tracy-Widom distribution with index , under the above sub-exponential condition. Based on this result, we constructed a goodness-of-fit test for a given set of row and column cluster numbers of an LBM, using the maximum singular value of matrix , which is an estimator of the matrix . We proved that under the null hypothesis (i.e., observed matrix consists of a given set of row and column cluster numbers), the proposed test statistic converges in law to the Tracy-Widom distribution with index (Theorem 4.1). We also showed that under the alternative hypothesis, test statistic increases in proportion to with a high probability, where is a number proportional to the matrix size (Theorems 4.2 and 4.3).
The proposed method solves the limitations of other model selection approaches. (1) Our statistical test method enables us to obtain knowledge about the reliability of the test results. When testing a given set of row and column cluster numbers, we can explicitly set the probability of Type I error (or false positive) as a significance level . (2) Unlike the other model selection methods, the proposed method does not depend on the clustering algorithm as long as it satisfies the consistency condition (Section 2). It only uses the output of a clustering algorithm to test a given set of cluster numbers; there is no need to modify the test method according to the clustering algorithm. (3) The proposed test method requires relatively small computational complexity. It does not require the MCMC procedure or partitioning into the training and test data sets. For these reasons, the proposed test-based method can be widely used for the purpose of knowledge discovery.
The next sections consist of the detailed explanation of the proposed test method for LBMs. In Section 2, we describe the proposed goodness-of-fit test and its theoretical guarantee with the assumptions required for the problem setting. Next, we briefly review the related works and their differences from the proposed method in Section 3. The main results are presented in Section 4, where we prove the asymptotic properties of the proposed test statistic. In Section 5, we experimentally demonstrate the effectiveness of the proposed test method by showing the asymptotic behavior of the test statistic and calculating the test accuracy. We discuss the results and limitations of the proposed method in Section 6 and conclude the paper in Section 7.
2 Problem setting and statistical model for goodness-of-fit test for latent block models
Let be an observed matrix. We assume that each entry of matrix is independently generated, given its row and column clusters. Let be the null set of cluster numbers for rows and columns of an observed matrix , which is unknown in advance. We denote the cluster indices of the th row and the th column of matrix as and , respectively. We assume that each entry of matrix is independently subject to a distribution with block-wise mean and block-wise standard deviation :
[TABLE]
where and , respectively, are the mean and the positive standard deviation of entries in the th null block under the null hypothesis.
In this paper, we propose a goodness-of-fit test for selecting the cluster numbers from observed matrix . In such a test, we test whether is equal to a given set of cluster numbers or at least one of the given row and column cluster numbers or is smaller than the null cluster numbers or . In other words, the null (N) and alternative (A) hypotheses are given by
[TABLE]
By sequentially testing the cluster numbers in the following order (Figure 1), we can select the cluster numbers of a given observed matrix .
Test . 2. 2.
Test . 3. 3.
Test . 4. 4.
5. 5.
Test . Let be the row and column cluster numbers where the null hypothesis is accepted and holds. The selected set of cluster numbers is .
Based on the above sequentially ordered test, selection of the cluster numbers requires tests at most.
Assumptions.
Throughout this paper, we make the following assumptions to derive the test statistics:
- (i).
We assume that a distribution of , which is given by as in (4) later, has a sub-exponential decay. That is, there exists some such that for , . From this assumption, note that for any , the th moment of a random variable is finite (i.e., ). 2. (ii).
We denote the number of rows and columns of matrix as and , respectively. We assume that both and increase in proportion to some sufficiently large number (i.e., ). 3. (iii).
Let and , respectively, be the minimum row and column cluster numbers to represent the block structure of observed matrix under the null hypothesis. We assume that both and are finite constants that do not increase with the matrix sizes and . We also assume that the minimum row and column sizes of a block in the null block structure, which we denote as and , respectively, satisfy and , where we used the following notation:
[TABLE]
In other words, we assume that with high probability, there is no “too small” block in matrix . 4. (iv).
If the given set of cluster numbers is equal to the null cluster numbers , then we call it a realizable case. Otherwise, we call it an unrealizable case ( or ). In Section 4, we see that Theorems 4.2 and 4.3 guarantee the behavior of the test statistic in unrealizable cases. For now, there is no way to detect the cases where or holds, and to cope with such settings is beyond the scope of this paper. 5. (v).
In the realizable case, we assume that a clustering algorithm is consistent, that is, the probability that it outputs the correct block structure converges to , in the limit of . By using this assumption, the proposed method does not depend on a specific clustering algorithm. Several clustering algorithms including [15, 1, 7] have been proven to be consistent.
3 Relation to existing works
In this section, we briefly review the related works and explain the differences between them and the proposed method.
3.1 Model selection for block models
Statistical-test-based methods (for SBM)
Recently, several methods have been proposed for testing the properties of a given observed matrix in relation to SBMs [5, 29, 24, 21, 53]. Particularly, the methods proposed in [5, 29, 21] have enabled us to estimate the number of blocks for SBMs. However, these methods differ from ours in the problem setting; they can be applied only to an SBM setting, where an observed matrix is a square symmetric matrix, and the cluster assignments are the same for rows and columns. There has been no method for estimating the block number for LBMs, where rows and columns (not necessarily square) of an observed matrix are simultaneously decomposed into clusters.
Cross-validation-based methods
Cross-validation is a widely used method for model selection, where a data set is first split into training and test data sets, and then the best model with the minimum test error is determined. Recently, cross-validation methods for matrix data have been proposed [11, 30, 25, 8] to determine the number of clusters in network data. Although the purpose of these methods and our method is similar, these methods differ from ours in that their target is the network data, where the observed matrix is square and its rows and columns represent the same node sets. Thus, the block structure is symmetric regardless of whether the network itself is directed or undirected). Moreover, unlike a statistical test, these methods cannot provide quantitative knowledge about the reliability of the selected model. Furthermore, the computational cost of cross-validation is generally high because it requires the iterative calculation of the test error with different data set partitions.
Information-criterion-based methods
Another approach for determining the number of blocks in a matrix is to estimate the generalization error or marginal likelihood by some information criterion for given sets of block numbers. By using such information criteria, we can select a model in a statistically meaningful (non-heuristic) way. In regard to block models, many variants of BIC have been proposed [26, 38, 27, 20, 34]. Unlike our test-based method, which only requires a clustering algorithm to satisfy the consistency condition (Section 2), an information criterion for a theoretical guarantee should be carefully chosen according to the given clustering algorithm. For instance, BIC can be employed for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models.
To solve this problem, as an alternative criterion to BIC, the integrated completed likelihood (ICL) criterion has been used in many studies for estimating the number of blocks in LBMs [31, 51, 10]. In ICL, we first derive a marginal likelihood for a given set of an observed matrix and block assignments and then substitute the set of estimated block assignments to approximate the marginal likelihood. However, since ICL is computed based on a single estimator of block assignments, there is no guarantee for the goodness of the approximation of marginal likelihood.
Similar to cross-validation-based methods, information-criterion-based methods cannot provide a probabilistic guarantee for the reliability of the selected model, which is a disadvantage for the purpose of knowledge discovery. The computational cost also becomes a problem because the computation of an information criterion requires the process of approximating the posterior distribution by MCMC.
Other model selection methods
Aside from the information criteria, several studies have proposed to determine the number of blocks in LBMs based on the co-clustering adjusted rand index [42], the extended modularity for biclustering [28], or the expected posterior loss for a given loss function [41]. Another approach is to define the posterior distribution not only on cluster assignments of rows and columns but also on row and column cluster numbers [50, 36]. Unlike the model selection approaches, such nonparametric Bayesian methods can estimate the distribution of the block numbers. The best-fitted number of the blocks can be determined based on the posterior distribution (e.g., we can choose a MAP estimator [36]). However, in this case, the computational cost of MCMC is higher than that of the information-criterion-based methods because it requires a large number of iterations to approximate the posterior distribution both on the block assignments and the number of blocks.
4 Test statistic for determining the set of cluster numbers
To derive the test statistic for the proposed goodness-of-fit test, we first normalize each entry of an observed matrix by subtracting and dividing it by , where and , respectively, are the block-wise mean and standard deviation in (2):
[TABLE]
By definition, each entry of matrix in (4) independently follows a distribution with zero mean and standard deviation of one. Therefore, according to the result in [39], if and in the limit of , the scaled maximum eigenvalue of matrix converges in law to the Tracy-Widom distribution with index () in the limit of :
[TABLE]
where is the maximum eigenvalue of matrix and
[TABLE]
In most cases, the null cluster numbers and the null cluster assignments and are unknown in advance. Therefore, we can only estimate the block structure based on the observed matrix and the given cluster numbers. Let be the given set of row and column cluster numbers, and and , respectively, be the estimated cluster assignments for rows and columns. Based on such an estimated block structure , we estimate the block-wise mean and standard deviation by
[TABLE]
where is the set of row indices of matrix that are assigned to the th cluster, and is the set of column indices of matrix that are assigned to the th cluster:
[TABLE]
The consistency assumption (v) guarantees that if , the probability that the cluster assignments and are correct converges to in the limit of .
We define an estimator of normalized matrix in (4) based on the estimated block-wise mean and standard deviation in (4):
[TABLE]
The test statistic for the proposed goodness-of-fit test is given by the scaled maximum eigenvalue of matrix :
[TABLE]
where is the maximum eigenvalue of matrix , and and are given by (6).
Based on the following results in Theorems 4.1, 4.2 and 4.3, we propose a one-sided goodness-of-fit test for a given set of cluster numbers at the significance level of by using the test statistic :
[TABLE]
where is the upper quantile of the Tracy-Widom distribution with index . By applying the sequentially ordered test that we explained in Section 2 based on the above rejection rule (11), we can select a set of row and column cluster numbers for a given observed matrix .
In the proof of Theorem 4.1, we also use the following notations. Let and , respectively, be the sample mean and standard deviation of all the entries in the th null block in matrix . Based on such notations, we define the sample mean matrix and standard deviation matrix for the correct block structure, and matrix by:
[TABLE]
Theorem 4.1** (Realizable case).**
We assume that the following condition holds: , in the limit of . Under the consistency assumption (v) for the clustering algorithm, if ,
[TABLE]
in the limit of , where is defined as in (10).
Proof.
We denote the operator norm by ,
[TABLE]
First of all, we derive the difference between () and (), which have been defined in (2) and (4). Since the number of entries in the block is proportional to by the assumption (iii), converges to from the central limit theorem. Therefore, from Prokhorov’s theorem [48], we have
[TABLE]
Also, the following equation holds (The proof is given in Appendix A):
[TABLE]
From here, we derive the difference between the maximum eigenvalue of matrix and the maximum eigenvalue of matrix , where the definitions of matrices and have been given in (4) and (4), respectively. From (5), we have . Therefore, the largest singular value of matrix , which is equal to , is in the order of .
By the subadditivity of the operator norm, we have
[TABLE]
Let , , , and , respectively, be the th null blocks of matrices , , , and . We also denote the row and column sizes of the th null block as and , respectively. From the definitions in (4) and (4), we have
[TABLE]
Combining this with (15), (16) and the fact that the Frobenius norm upper bounds the operator norm, we have
[TABLE]
Therefore, since the operator norm of a matrix is not larger than the sum of the operator norms of all of its blocks and the number of blocks are finite constants, we have
[TABLE]
By combining this with (17), we obtain
[TABLE]
Next, we consider the joint probability of the event that the clustering algorithm outputs the correct block structure (i.e., ) and the event that holds. Such a joint probability satisfies the following inequality:
[TABLE]
where is the complement of event . The consistency assumption (v) guarantees that if , converges to [math] in the limit of . By combining this fact with (21), we obtain
[TABLE]
which results in
[TABLE]
By using the above results, we can prove that the following equation holds for all (The proof is given in Appendix B):
[TABLE]
From (5), (25), and Slutsky’s theorem, by setting ,
[TABLE]
This is equivalent to the statement of Theorem 4.1. ∎
Theorem 4.2** (Unrealizable case, lower bound).**
Suppose or .
[TABLE]
where is defined as in (10).
Proof.
Let be a matrix that consists of the estimated block structure and whose entries are the population block-wise means, which can be calculated using (see also Figure 2).
To derive the difference between matrices and , we first focus on the relationship between matrices and . In the unrealizable case (i.e., or ), we can assume without loss of generality.
Let be the number of rows in the th null row cluster. For all the null row cluster indices , at least one estimated row cluster contains or more rows that are assigned to the th row cluster in the null block structure (otherwise, the total number of rows in the th null row cluster is smaller than ). Since , at least one estimated block contains two or more sets of rows whose null row clusters are mutually different, and both of which have the row sizes of at least , where is the minimum row size of a block in the null block structure. By the same reasoning, for all the null column cluster indices , at least one estimated column cluster contains or more columns that are assigned to the th column cluster in the null block structure, where is the number of rows in the th null column cluster. By combining these facts, there exists at least one estimated block that contains two or more submatrices, both of which have the sizes of at least and whose null blocks are mutually different.
Let and be such submatrices, whose null block-wise mean are and , respectively. We can assume without the loss of generality. In matrix , which has the estimated block structure, both of and have the same values . Here, holds if , and otherwise. Therefore, for any , there exists at least one submatrix (which is either or ) with a size of at least , where all the entries are (which is either or ) in matrix and
[TABLE]
Let be the row and column cluster indices of the estimated block which contains the above submatrix . We denote the row and column sizes of the th estimated block as and , respectively. Let , , , and , respectively, be the th estimated block of , , , and . We define . In regard to the difference between matrices and (both of which have the estimated block structure), we have
[TABLE]
where , , and , respectively, are the th null blocks of matrices , , and , and and . To derive the final equation in (4), we used the assumption that and the fact that is equal to the largest singular value of , which is from (5).
Let be the event that holds. For all , , and , the following inequality holds:
[TABLE]
By combining (4) and (30), we obtain
[TABLE]
From now on, we denote the row and column sizes of submatrix , respectively, by and . Let , , , , , and , respectively, be the submatrices of matrices , , , , , and with the same row and column indices as submatrix . We also denote the constant entries of the submatrices of and with the same row and column indices as submatrix , respectively, as and . From the definition (9) and since the operator norm of a submatrix is not larger than that of the original matrix, we have
[TABLE]
First, the order of the estimated standard deviation is given by
[TABLE]
The proof of (33) is in Appendix C.
The only non-zero (and thus, the largest) singular value of matrix is . Since the largest singular value of a matrix is equal to its operator norm, we have
[TABLE]
Therefore, by combining this fact with (4), if the statement of event holds, the following inequality also holds:
[TABLE]
which results in that , where .
Also, from (5), we have . By substituting this fact, (33), and (35) into (4), we finally obtain
[TABLE]
Here, is equal to the maximum eigenvalue of matrix , and the test statistic is . Using the definition (6), we obtain and
[TABLE]
where we used the definitions and .
By combining these results and (36), we obtain
[TABLE]
which concludes the proof. ∎
Theorem 4.3** (Unrealizable case, upper bound).**
Suppose or . Then,
[TABLE]
where is defined as in (10).
Proof.
We define , , and as in Theorem 4.2. Let , , and , respectively, be the th estimated blocks of matrices , , and . We denote the row and column sizes of the th estimated block as and , respectively. Since the operator norm of a matrix is not larger than the sum of the operator norms of all its blocks, we have
[TABLE]
The test statistic is , where . Based on the same discussion as in Theorem 4.2, and (4) hold. Consequently, we obtain , which concludes the proof. ∎
5 Experiments
5.1 Realizable case: convergence of test statistic in law to Tracy-Widom distribution
First of all, we checked the convergence of the proposed test statistic in law to the Tracy-Widom distribution with index , under the realizable setting, which has been stated in Theorem 4.1, by using synthetic data that were generated based on three types of distributions:
- •
Gaussian Latent Block Model: The observed matrices were generated whose entries in the th block follows the normal distribution . In the Gaussian LBM setting, we used the following null model and parameters:
[TABLE]
- •
Bernoulli Latent Block Model The observed matrices were generated whose entries in the th block follows the normal distribution . In the Bernoulli LBM setting, we used the following null model and parameters:
[TABLE]
- •
Poisson Latent Block Model The observed matrices were generated whose entries in the th block follows the normal distribution . In the Poisson LBM setting, we used the following null model and parameters:
[TABLE]
Based on the above Latent Block Model, we randomly generated observed matrices, estimated their block structures based on the Ward’s hierarchical clustering algorithm [49], and computed the test statistic . With respect to the matrix size, we tried the following settings: , . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on .
Figures 5, 5, and 5, respectively, show the Q-Q plots of the test statistic and the distribution in the settings of Gaussian, Bernoulli, and Poisson settings. Each plotted point corresponds to a sample of test statistic , and the horizontal and vertical lines, respectively, show its theoretical and sample quantiles. These figures show that the test statistic converged in law to the distribution.
Figure 7 shows the ratios of the trials where , , and for the above three distributional settings, where is the upper quantile of the distribution. We used the approximated values , , and , according to Table in [47]. From Figure 7, we see that the tail probability of the test statistic also converged to those of the distributions for all of the three distributional settings.
We also plotted the results of the Kolmogorov-Smirnov test [9] for the test statistic in Figure 7. We tested whether the distribution of is the distribution or not based on the test statistic , where is the maximum absolute difference between the empirical distribution function of and the cumulative distribution function of the distribution, and is the sample size, which is set at in this experiment. Figure 7 shows the convergence of the proposed test statistic in law to the distribution under the realizable setting.
5.2 Unrealizable case: asymptotic behavior of test statistic
Next, we checked the asymptotic behavior of the proposed test statistic under the unrealizable setting, which has been stated in Theorems 4.2 and 4.3, by using synthetic data that were generated based on the same three types of distributions as in Section 5.1. By combining Theorems 4.2 and 4.3, we obtain the following theorem:
Theorem 5.1** (Unrealizable case, two-sided bound).**
Suppose or . Then,
[TABLE]
In other words, with high probability, the proposed test statistic increases in proportion to in the limit of , since we have assumed that .
With respect to the null models and parameters, we used the same settings as in Section 5.1 for all of the three distributional settings (i.e., Gaussian, Bernoulli, and Poisson LBMs). Based on such settings, we randomly generated observed matrices, estimated their block structures based on the Ward’s hierarchical clustering algorithm [49], and computed the test statistic . With respect to the matrix size, we tried the following settings: , . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on .
Figures 9 and 9 show the asymptotic behavior of the proposed test statistic under the unrealizable setting. As shown in Theorem 5.1, we see that increases in proportion to , where .
5.3 Accuracy of the proposed goodness-of-fit test
Finally, we evaluated the proposed goodness-of-fit test in terms of its accuracy. By using synthetic data that were generated based on the same three types of distributions as in Section 5.1, we checked the ratio of trials where the selected set of cluster numbers is equal to the null one . Here, we set the null set of cluster numbers at . For each distributional setting (i.e., Gaussian, Bernoulli, and Poisson LBMs), we tried settings with respect to the block-wise mean . The concrete settings were as follows:
- •
Gaussian Latent Block Model: We used the following parameters:
[TABLE]
- •
Bernoulli Latent Block Model We used the following parameters:
[TABLE]
- •
Poisson Latent Block Model We used the following parameters:
[TABLE]
With respect to the matrix size, we tried the following settings for each distributional setting and for each setting of : , . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on . In each of (Gaussian, Bernoulli, or Poisson LBM) (for the setting of ) (for the setting of matrix size) settings, we generated observed matrices and applied the proposed sequential goodness-of-fit test, until the null hypothesis was accepted. For each observed matrix, we estimated their block structures based on the Ward’s hierarchical clustering algorithm [49] under each setting of a hypothetical set of cluster numbers , computed the test statistic , and performed the proposed test for the given cluster numbers using a significance level of . Figures 12, 12, 12, respectively, show the examples of generated observed matrices of Gaussian, Bernoulli, and Poisson LBMs.
Figure 13 shows the accuracy of the proposed test under different settings of block-wise mean . From Figure 13, we see that the test accuracy increases with matrix size for a fixed block-wise mean , and that it decreases with the smaller differences between the block-wise means for a fixed matrix size .
Comparison to the integrated completed likelihood (ICL)
We also checked the difference in the behavior of the proposed test and the ICL. For the Bernoulli LBM, we can compute the asymptotic ICL [26] by assuming the following model:
[TABLE]
where represents a probability density, and and are the hyperparameters.
From Lemma 4.2 in [26], for an estimated block structure , the resulting asymptotic ICL is given by
[TABLE]
The proof of (5.3) is given in Appendix D.
To check the accuracy of the proposed test and the ICL, we generated synthetic binary data matrices based on the Bernoulli distribution as in Section 5.3, and checked the ratio of trials where the selected set of cluster numbers is equal to the null one . We set the null set of cluster numbers at , and tried the following five settings with respect to the block-wise mean .
[TABLE]
With respect to the matrix size, we tried the following five settings for each setting of : , . The null block of each element was chosen in the same way as in Section 5.3. In each of (for the setting of ) (for the setting of matrix size) settings, we generated observed matrices, and applied the proposed test using a significance level of and the model selection based on the ICL. Unlike the proposed sequential test, which stopped if the null hypothesis was accepted, the ICL was computed for all the sets of cluster numbers from to and then the optimal setting was selected that achieved the maximum ICL. For each setting, we estimated the block structure of an observed matrix based on the Ward’s hierarchical clustering algorithm [49].
Figure 14 shows the accuracy of the proposed test and the model selection based on the ICL. Although the purpose of the proposed test is not to achieve high accuracy in model selection, in some cases with small differences between the block-wise means , it achieved better performance than the ICL. With larger difference between , the ICL performed better than the proposed test in terms of model selection. Figures 15 and 16, respectively, show the ratios of the trials where each set of cluster numbers was selected by the proposed test and the ICL. From Figures 15 and 16, we see that in most cases (e.g., for all ), the ICL tended to select smaller sets of cluster numbers than the proposed test.
5.4 Real data analysis: Congressional Voting Records Data Set
We also checked the result when we applied the proposed test to 1984 United States Congressional Voting Records Database from UCI Machine Learning Repository [14]. The original data set contains three types of votes (“yea,” “nay,” and unknown) for the pairs of a congressman and an attribute. We treated unknown as “nay,” as in [51]. The number of instances or congressmen and that of attributes are and , respectively. Based on this data set, we defined a binary matrix , where the elements of one and zero, respectively, correspond to “yea” and “nay.”
As in Section 5.3, we applied the proposed sequential tests using a significance level of , until the null hypothesis was accepted. We also computed the ICL for each setting of a hypothetical set of cluster numbers , and selected one with the largest ICL. For each setting of a hypothetical set of cluster numbers , we estimated the block structure based on the Ward’s hierarchical clustering algorithm [49].
As a result, the sets of cluster numbers and were selected by the proposed test and the ICL, respectively. Figure 17 shows the observed data matrix and its estimated block structures with the selected sets of cluster numbers. From Figure 17, we see that a finer block structure was accepted by the proposed test than the ICL, particularly for the row (i.e., congressman) cluster assignments. As for the column (i.e., attribute) cluster assignments, “anti-satellite-test-ban,” “aid-to-nicaraguan-contras,” and “mx-missile” were assigned into the same cluster in the selected block structure of the ICL, whereas the proposed test distinguished the first two attributes from the last one. Figure 18 shows the -value of the proposed test and the ICL for each setting of a hypothetical set of cluster numbers until the null hypothesis was accepted.
6 Discussion
In this section, we discuss the proposed test method in terms of the test statistic and the conditions for the generative model.
With respect to the asymptotic behavior, the proposed test has a favorable property in terms of the power. From Theorem 5.1, under the alternative hypothesis, the test statistic increases in proportion to with high probability, where . In other words, the probability that the test makes a type II error (i.e., ) converges to zero in the limit of . Based on this fact, in the asymptotic sense, we do not need to consider the correction for the multiple comparison when applying the proposed sequential testing. However, it has not been shown what occurs in the non-asymptotic setting. In general, practical data matrices have finite sizes, where there has been shown no theoretical guarantee like Theorems 4.1, 4.2, and 4.3. On the other hand, for a Gaussian case (i.e., each entry of a matrix independently follows ), the following statement holds [32]: Suppose and in the limit of . Then, for any , there exists such that when and is even, for all ,
[TABLE]
where is defined as in (5) and is a continuous and non-increasing function. From the above inequality (51), if the clustering algorithm outputs the correct block assignments, the convergence rate of the normalized maximum eigenvalue of matrix (where is defined as in (4)) to the Tracy-Widom distribution with index is . However, since the distribution of is unknown in the case where the correct block assignment is not obtained, the convergence rate of is also unknown. Deriving the convergence rate of by considering the above discussion is a future research topic.
In regard to the conditions for using the proposed test method, our proposed test is applicable to a wide range of practical distributional settings (e.g., Bernoulli distribution for binary data matrices and Poisson distribution for sparse ones). Nevertheless, it still requires some assumptions for the latent block structure of an observed matrix. For instance, the row and column cluster numbers should be constants that do not increase with the matrix sizes and . Also, there should be no too small block (i.e., and ). In some practical cases, where it is more appropriate to assume that the number of blocks increases with the matrix size, it will be useful to construct a test which does not require the above conditions. As for the sub-exponential condition, Ding and Yang [13] have shown more relaxed sufficient condition for the scaled maximum eigenvalue to converge in law to distribution. However, the delocalization property of an eigenvector of matrix [6], which we used in Appendix B to prove our main result, has not been derived in the form as in the sub-exponential case [6]. If (74) in Appendix B is shown in the above more general case, it would also be possible to extend our proposed test to such a case. Furthermore, there are proposed variants of latent block models with which we assume different block structures from a regular grid [43, 35]. To construct test methods for the above settings is an important topic for future research.
7 Conclusion
Latent block models are effective tools for biclustering, where rows and columns of an observed matrix are simultaneously decomposed into clusters. Such a bicluster structure appears in various types of relational data, such as the customer-product transaction data or and the document-word relationship data. One open problem in using latent block models is that there has been no statistical test method for determining the number of blocks. In this study, we developed a goodness-of-fit test for latent block models based on a result from the random matrix theory. By defining the test statistic based on the estimators of the block-wise means and standard deviations, we have derived its asymptotic behavior in both realizable (i.e., ) and unrealizable (i.e., or ) cases. Particularly, it has been shown that the test statistic converges in law to Tracy-Widom distribution with index in the realizable case. Based on these results, it was made possible to test whether the given observed matrix had latent blocks or more ones. In the experiments, we showed the validity of the proposed test method in terms of both the asymptotic behavior of the test statistic and the test accuracy by using synthetic data matrices with ground truth block structures.
Acknowledgments
TS was partially supported by JSPS KAKENHI (18K19793, 18H03201, and 20H00576), Japan Digital Design, and JST CREST.
Appendix A Proof of .
Let and , respectively, be the row and column sizes of the th null block, and , and , respectively, be the th null blocks of matrices , and . Here, we prove the following lemma:
Lemma A1**.**
Under the assumption that the fourth moment of the noise is bounded (),
[TABLE]
where .
Proof.
From the above definition of , we have
[TABLE]
To derive the second equation, we used the fact that . Therefore, the following inequality holds:
[TABLE]
The first term in (54) is given by
[TABLE]
where we defined that . Note that is independent. The expectation and the variance of are given by
[TABLE]
From (A), we have
[TABLE]
From (A), (A), and Chebyshev’s inequality, for all ,
[TABLE]
Therefore, we have
[TABLE]
On the other hand, the second term in (54) is given by
[TABLE]
From (A), we have
[TABLE]
where . Here, , and . By substituting this into (61), we obtain
[TABLE]
From Markov’s inequality and (62),
[TABLE]
Therefore, we have
[TABLE]
By combining (54), (59), and (64),
[TABLE]
The difference between and is given by
[TABLE]
Here, from (65), is bounded in probability. Therefore, converges in probability to . By combining this fact with (65) and (66), we finally obtain
[TABLE]
which concludes the proof. ∎
Appendix B Proof of in realizable case
We first derive the relationship between the maximum eigenvalues of matrices and in Lemma B1 and B2.
Lemma B1**.**
Let and , respectively, be the maximum eigenvalues of matrices and (i.e., and , respectively). Then, for all , the following equation holds:
[TABLE]
Proof.
Let and , respectively, be the normalized eigenvectors of and , corresponding to the maximum eigenvalues and :
[TABLE]
Since is the largest singular value of matrix , we have
[TABLE]
We also define the following matrix for each th block:
[TABLE]
where and , respectively, are the row and column sizes of the th null block.
Let , , and , respectively be matrices whose th null blocks are , and and whose all the other entries are zero. As shown in Figure 19, we define matrix as .
From (70), we have
[TABLE]
where is the maximum eigenvalue of matrix .
From now on, we prove that and . We use the following notations:
[TABLE]
where is the th entry of vector . Note that the th block of matrix , the th block of matrix , and the th block of vector are given by , , and , respectively. Let be a vector whose entries in the th column cluster are and whose all the other entries are zero. Here, from Theorem 2.17 in [6], each th eigenvector of matrix has a delocalization property, that is, for any constant vector that satisfies ,
[TABLE]
From Theorem 2.20 in [6], (74) holds uniformly in and .
Note that this theorem holds in our case, where we assume that and that each entry of is independently generated from a distribution with zero mean and unit variance that satisfies the sub-exponential condition. Therefore, we have for all . Since and , , for all by definition, the following equation holds:
[TABLE]
Similarly, th block of matrix is , and its all the other entries are zero, which results in that . Therefore, we have
[TABLE]
Moreover, from Lemma A1, we have
[TABLE]
By substituting (B), (B), and (77) into (B) and by setting , we have
[TABLE]
which concludes the proof. ∎
Lemma B2**.**
Let and , respectively, be the maximum eigenvalues of matrices and . Then, the following equation holds:
[TABLE]
Proof.
We use the same notations as in Lemmas B1. Let and , respectively, be the sets of the eigenvalues and the corresponding normalized eigenvectors (i.e., for all ) of matrix , where and . We also define that . Note that from (77). Let be a subvector of in the th column cluster.
Since is a symmetric matrix, its eigenvectors form an orthonormal system, and thus there exists a unique set of coefficients that satisfies
[TABLE]
where
[TABLE]
Therefore, the following equation holds:
[TABLE]
As for the last term in (B), the following equation holds:
[TABLE]
where is a vector whose elements in the th column cluster is and whose all the other elements are zero. In the last equation in (B), we used the delocalization property of , which are eigenvectors of matrix . By substituting (84) into (B) and using the fact that and the assumption that and are fixed constants, we have
[TABLE]
Here, by definition in (B), the following equation holds:
[TABLE]
The third term in (B) can be upper bounded as follows:
[TABLE]
where is a vector whose elements in the th row cluster is and whose all the other elements are zero. Here we used the delocalization property of , which are eigenvectors of matrix .
The fourth term in (B) can also be upper bounded as follows:
[TABLE]
By substituting (B) and (B) into (B), we have
[TABLE]
In the last equation of (B), we used the assumption that and are fixed constants.
Let be a normalized eigenvalue of matrix . Note that in (B) is the number of normalized eigenvalues that satisfy . We also define the following variables:
[TABLE]
From (4.1) of [39], holds for some constant , where . Since holds for any , we have for any .
Since follows the Marcenko–Pastur distribution, whose probability density function is given by
[TABLE]
by setting , we have
[TABLE]
From (B) and the fact that for any , by setting , the following equation holds:
[TABLE]
From (3.7) of [39], the difference between and is given by
[TABLE]
Therefore, by setting , we have
[TABLE]
By assumption in (B) that , the following equation holds for all :
[TABLE]
Here, we consider the following two patterns: (a) If , from (B), we have
[TABLE]
(b) If , we have and thus
[TABLE]
By assumption in (B) that , we have and thus (97) holds.
In summary, (97) always holds. By combining this fact and (85), we have
[TABLE]
By setting , we finally obtain
[TABLE]
which concludes the proof. ∎
Lemma B3**.**
Let and , respectively, be the maximum eigenvalues of matrices and (i.e., and , respectively). Then, for all ,
[TABLE]
where is defined as in (6).
Proof.
From Lemma B1 and B2, we have already shown that the following equation holds for all :
[TABLE]
We consider the joint probability of the event that the clustering algorithm outputs the correct block structure (i.e., ) and the event that holds. Such a joint probability satisfies the following inequality:
[TABLE]
where is the complement of event . The consistency assumption (v) guarantees that if , converges to [math] in the limit of . By combining this fact with (102), we obtain
[TABLE]
which results in (101). ∎
Appendix C Proof of in unrealizable case
Proof.
Throughout the proof, we use the following notations:
- •
, , and , respectively, are the th null blocks of matrices , , and .
- •
, , and , respectively, are the th estimated blocks of matrices , , and .
- •
We denote the row and column sizes of the th estimated block as and , respectively.
- •
is the set of row and column cluster indices of submatrix in the estimated block structure.
As for the order of the estimated standard deviation , we have . Note that the block size of submatrix is at least . Therefore, we have
[TABLE]
Here, for all , independently follows the same distribution, and . We also have , since we have assumed that from the sub-exponential assumption. Therefore, from the central limit theorem and Prokhorov’s theorem [48], we have . In other words, the following equation holds: . Based on this result, we obtain
[TABLE]
Furthermore, we have
[TABLE]
Here, to derive the last inequality in (C), we used the assumption that (4) holds for the block with the maximum difference between and . In the final equation, we used the fact that is bounded by a finite constant.
By combining (C), (C), and (C), we obtain . ∎
Appendix D Proof of the asymptotic ICL in the Bernoulli case
Proof.
From Lemma 4.2 in [26], the resulting asymptotic ICL is given by
[TABLE]
In regard to the first term in (D), we consider the following optimization problem:
[TABLE]
The above problem is solved with the Lagrangian undetermined multiplier method, which employs
[TABLE]
By substituting (D) into (111), we have
[TABLE]
for all . In regard to and , from the conditions in (D), and hold and thus we finally have
[TABLE]
We can easily check that the solutions of (D) and (113) satisfy all the conditions in (D).
Finally, by substituting the above results into (D), we have
[TABLE]
Note that we have defined as in (4). ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. P. W. Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming , 147(1):429–465, 2014.
- 2[2] P. Arabie, S. A. Boorman, and P. R. Levitt. Constructing blockmodels: How and why. Journal of Mathematical Psychology , 17(1):21–63, 1978.
- 3[3] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability , 21(3):1275–1294, 1993.
- 4[4] Z. Bao, G. Pan, and W. Zhou. Universality for the largest eigenvalue of sample covariance matrices with general population. The Annals of Statistics , 43(1):382–421, 2015.
- 5[5] P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(1):253–273, 2016.
- 6[6] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of sample covariance matrices. Probability Theory and Related Fields , 164:459–552, 2016.
- 7[7] V. Brault and A. Channarond. Fast and consistent algorithm for the latent block model. ar Xiv:1610.09005, 2016.
- 8[8] K. Chen and J. Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association , 113(521):241–251, 2018.
