Goodness-of-fit Test for Latent Block Models

Chihiro Watanabe; Taiji Suzuki

arXiv:1906.03886·stat.ML·September 18, 2020

Goodness-of-fit Test for Latent Block Models

Chihiro Watanabe, Taiji Suzuki

PDF

Open Access

TL;DR

This paper introduces a new statistical goodness-of-fit test for latent block models, enabling validation of cluster numbers in relational data matrices, extending prior methods limited to symmetric stochastic block models.

Contribution

The study develops the first goodness-of-fit test for latent block models, applicable to non-symmetric matrices with separate row and column clusters, using random matrix theory.

Findings

01

Test statistic exhibits expected asymptotic behavior

02

Method accurately determines the correct number of clusters

03

Effective in various simulated data scenarios

Abstract

Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the…

Equations361

P = (P_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, P_{ij} = B_{g_{i}^{(1)} g_{j}^{(2)}} .

P = (P_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, P_{ij} = B_{g_{i}^{(1)} g_{j}^{(2)}} .

σ = (σ_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, σ_{ij} = S_{g_{i}^{(1)} g_{j}^{(2)}} .

A = (A_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, E [A_{ij}] = P_{ij}, E [(A_{ij} - P_{ij})^{2}] = σ_{ij}^{2},

(N) : (K, H) = (K_{0}, H_{0}), (A) : K > K_{0} or H > H_{0} .

(N) : (K, H) = (K_{0}, H_{0}), (A) : K > K_{0} or H > H_{0} .

X = Ω_{p} (f (m)) .

X = Ω_{p} (f (m)) .

\Leftrightarrow \forall ϵ > 0, \exists C > 0, M > 0, \forall m \geq M, Pr (C f (m) \leq X) \geq 1 - ϵ .

Z = (Z_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, Z_{ij} = \frac{A _{ij} - P _{ij}}{σ _{ij}} .

Z = (Z_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, Z_{ij} = \frac{A _{ij} - P _{ij}}{σ _{ij}} .

T^{*} = \frac{λ _{1} - a}{b}, T^{*} ⇝ T W_{1} (Convergence in law),

T^{*} = \frac{λ _{1} - a}{b}, T^{*} ⇝ T W_{1} (Convergence in law),

a = (n + p)^{2}, b = (n + p) (\frac{1}{n} + \frac{1}{p})^{\frac{1}{3}} .

a = (n + p)^{2}, b = (n + p) (\frac{1}{n} + \frac{1}{p})^{\frac{1}{3}} .

\hat{B} = (\hat{B}_{k h})_{1 \leq k \leq K_{0}, 1 \leq h \leq H_{0}}, \hat{B}_{k h} = \frac{1}{∣ I _{k} ∣∣ J _{h} ∣} i \in I_{k}, j \in J_{h} \sum A_{ij},

\hat{B} = (\hat{B}_{k h})_{1 \leq k \leq K_{0}, 1 \leq h \leq H_{0}}, \hat{B}_{k h} = \frac{1}{∣ I _{k} ∣∣ J _{h} ∣} i \in I_{k}, j \in J_{h} \sum A_{ij},

\hat{P} = (\hat{P}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \hat{P}_{ij} = \hat{B}_{\overset{g}{^}_{i}^{(1)} \overset{g}{^}_{j}^{(2)}},

\hat{S} = (\hat{S}_{k h})_{1 \leq k \leq K_{0}, 1 \leq h \leq H_{0}}, \hat{S}_{k h} = \frac{1}{∣ I _{k} ∣∣ J _{h} ∣} i \in I_{k}, j \in J_{h} \sum (A_{ij} - \hat{P}_{ij})^{2},

\overset{σ}{^} = (\overset{σ}{^}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \overset{σ}{^}_{ij} = \hat{S}_{\overset{g}{^}_{i}^{(1)} \overset{g}{^}_{j}^{(2)}},

I_{k} = {i : \overset{g}{^}_{i}^{(1)} = k}, J_{h} = {j : \overset{g}{^}_{j}^{(2)} = h} .

I_{k} = {i : \overset{g}{^}_{i}^{(1)} = k}, J_{h} = {j : \overset{g}{^}_{j}^{(2)} = h} .

\hat{Z} = (\hat{Z}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \hat{Z}_{ij} = \frac{A _{ij} - P ^ _{ij}}{σ ^ _{ij}} .

\hat{Z} = (\hat{Z}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \hat{Z}_{ij} = \frac{A _{ij} - P ^ _{ij}}{σ ^ _{ij}} .

T = \frac{λ ^ _{1} - a}{b},

T = \frac{λ ^ _{1} - a}{b},

Reject null hypothesis ((K, H) = (K_{0}, H_{0})), if T \geq t (α),

Reject null hypothesis ((K, H) = (K_{0}, H_{0})), if T \geq t (α),

\tilde{P} = (\tilde{P}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \tilde{P}_{ij} = \tilde{B}_{g_{i}^{(1)} g_{j}^{(2)}},

\tilde{P} = (\tilde{P}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \tilde{P}_{ij} = \tilde{B}_{g_{i}^{(1)} g_{j}^{(2)}},

\tilde{σ} = (\tilde{σ}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \tilde{σ}_{ij} = \tilde{S}_{g_{i}^{(1)} g_{j}^{(2)}},

\tilde{Z} = (\tilde{Z}_{ij})_{1 \leq i \leq n, 1 \leq j \leq p}, \tilde{Z}_{ij} = \frac{A - P ~ _{ij}}{σ ~ _{ij}} .

T ⇝ T W_{1} (Convergence in law),

T ⇝ T W_{1} (Convergence in law),

∥ A ∥_{op} = u \in R^{p} sup \frac{∥ A u ∥}{∥ u ∥} .

∥ A ∥_{op} = u \in R^{p} sup \frac{∥ A u ∥}{∥ u ∥} .

\tilde{B}_{k h} = B_{k h} + O_{p} (\frac{1}{m}) .

\tilde{B}_{k h} = B_{k h} + O_{p} (\frac{1}{m}) .

\tilde{S}_{k h} = S_{k h} + O_{p} (\frac{1}{m}) .

\tilde{S}_{k h} = S_{k h} + O_{p} (\frac{1}{m}) .

∥ Z ∥_{op} - ∥ \tilde{Z} ∥_{op} \leq ∥ Z - \tilde{Z} ∥_{op} .

∥ Z ∥_{op} - ∥ \tilde{Z} ∥_{op} \leq ∥ Z - \tilde{Z} ∥_{op} .

Z^{(k, h)} = \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}}, \tilde{Z}^{(k, h)} = \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}} .

Z^{(k, h)} = \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}}, \tilde{Z}^{(k, h)} = \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}} .

∥ Z^{(k, h)} - \tilde{Z}^{(k, h)} ∥_{op} = \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}} - \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}}_{op}

∥ Z^{(k, h)} - \tilde{Z}^{(k, h)} ∥_{op} = \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}} - \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}}_{op}

= \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}} - \frac{A ^{(k, h)} - P ^{(k, h)}}{S ~ _{k h}} + \frac{A ^{(k, h)} - P ^{(k, h)}}{S ~ _{k h}} - \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}}_{op}

\leq \frac{A ^{(k, h)} - P ^{(k, h)}}{S _{k h}} - \frac{A ^{(k, h)} - P ^{(k, h)}}{S ~ _{k h}}_{op} + \frac{A ^{(k, h)} - P ^{(k, h)}}{S ~ _{k h}} - \frac{A ^{(k, h)} - P ~ ^{(k, h)}}{S ~ _{k h}}_{op}

= \frac{S ~ _{k h} - S _{k h}}{S _{k h} S ~ _{k h}} ∥ A^{(k, h)} - P^{(k, h)} ∥_{op} + \frac{1}{S ~ _{k h}} ∥ P^{(k, h)} - \tilde{P}^{(k, h)} ∥_{op}

\leq \frac{S ~ _{k h} - S _{k h}}{S _{k h} S ~ _{k h}} ∥ A^{(k, h)} - P^{(k, h)} ∥_{op} + \frac{1}{S ~ _{k h}} ∥ P^{(k, h)} - \tilde{P}^{(k, h)} ∥_{F}

= \frac{S ~ _{k h} - S _{k h}}{S ~ _{k h}} ∥ Z^{(k, h)} ∥_{op} + \frac{1}{S ~ _{k h}} n_{k} p_{h} B_{k h} - \tilde{B}_{k h}

= \frac{O _{p} ( 1/ m )}{S _{k h} + O _{p} ( 1/ m )} ∥ Z^{(k, h)} ∥_{op} + \frac{O _{p} ( 1/ m )}{S _{k h} + O _{p} ( 1/ m )} n_{k} p_{h} (∵ (\ref e q : B B t i l d e d i f f), (\ref e q : S S t i l d e d i f f))

= \frac{O _{p} ( 1/ m )}{S _{k h} + O _{p} ( 1/ m )} O_{p} (m) + \frac{O _{p} ( 1/ m )}{S _{k h} + O _{p} ( 1/ m )} n_{k} p_{h} (∵ (\ref e q : T_{t} r u e))

= O_{p} (\frac{1}{m}) + O_{p} (1) = O_{p} (1) .

∥ Z - \tilde{Z} ∥_{op}

∥ Z - \tilde{Z} ∥_{op}

∥ Z ∥_{op} - ∥ \tilde{Z} ∥_{op} = O_{p} (1) .

∥ Z ∥_{op} - ∥ \tilde{Z} ∥_{op} = O_{p} (1) .

Pr (F_{m} \cap G_{m, C}) \geq 1 - Pr (F_{m}^{C}) - Pr (G_{m, C}^{C}),

Pr (F_{m} \cap G_{m, C}) \geq 1 - Pr (F_{m}^{C}) - Pr (G_{m, C}^{C}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference

Full text

Goodness-of-fit Test for Latent Block Models

Chihiro Watanabe [email protected] Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan

Taiji Suzuki [email protected] Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan

Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan

Abstract

Latent block models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of latent block models. Recent studies have constructed statistical-test-based methods for stochastic block models, which assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this study, we developed a new goodness-of-fit test for latent block models to test whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of the row and the column. To construct the test method, we used a result from the random matrix theory for a sample covariance matrix. We experimentally demonstrated the effectiveness of the proposed method by showing the asymptotic behavior of the test statistic and measuring the test accuracy.

1 Introduction

Block modeling [18, 2] is known to be effective in representing various relational data sets, such as the data sets of movie ratings [44], customer-product transactions [44], congressional voting [27], document-word relationships [12], and gene expressions [40]. Latent block models or LBMs [17] are used for probabilistic biclustering of such relational data matrices, where rows and columns represent different objects. For instance, suppose that a matrix $A=(A_{ij})_{1\leq i\leq n,1\leq j\leq p}\in\mathbb{R}^{n\times p}$ represents the relationship between users and movies, where entry $A_{ij}$ is the rating of the $j$ th movie by the $i$ th user. In LBMs, we assume a regular-grid block structure behind the observed matrix $A$ ; i.e., both rows (users) and columns (movies) of matrix $A$ are simultaneously decomposed into latent clusters. A block is defined as a combination of row and column clusters, and entries of the same block in matrix $A$ are supposed to be i.i.d. random variables.

An open problem in using LBMs is that there has been no statistical criterion for determining the numbers of row and column clusters. Recently, statistical-test-based approaches [5, 29, 21] have been proposed for estimating the cluster number of stochastic block models (SBMs) [19]. SBMs are similar to LBMs in the sense that they assume a block structure behind an observed matrix; however, they are based on different assumptions from LBMs that an observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns [33]. In regard to the LBM setting, no statistical method has been constructed to determine row and column cluster numbers.

Aside from the test-based methods, several model selection approaches have been proposed based on cross-validation [8] or an information criterion [26, 38, 27]. However, these approaches have several limitations. (1) First, they cannot provide knowledge about the reliability of the result besides the finally estimated cluster numbers. Rather than minimizing the generalization error, in some cases, it is more appropriate to provide a probabilistic guarantee in reliability for the purpose of knowledge discovery. (2) Second, both the cross-validation-based and information-criterion-based methods depend on the clustering algorithm used. For instance, we can employ the Bayesian information criterion (BIC) for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models. Constructing an information criterion that estimates the expectation of the generalization error for a wider class of models is generally difficult. (3) Finally, the above methods require relatively large computational complexity. Computation of an information criterion requires the process of approximating the posterior distribution by the Markov chain Monte Carlo (MCMC) method, and cross-validation requires the iterative calculation of the test error with different sets of partitions of the training and test data sets.

In this study, we proposed a new statistical test method for LBMs. To construct a hypothesis test with a theoretical guarantee, we used a result from random matrix theory. Recent studies on random matrix theory have revealed the asymptotic behavior of singular values of an $n\times p$ random matrix [16, 45, 52, 3, 22, 23, 46, 37, 39, 4, 13]. Here, we assume that each entry $Z_{ij}$ of matrix $Z$ , which is given by $Z_{ij}=(A_{ij}-P_{ij})/\sigma_{ij}$ (which is computed by the original matrix $A$ , its block-wise mean $P$ and standard deviation $\sigma$ ) follows a distribution with a sub-exponential decay. From the result in [39], the normalized maximum eigenvalue of $Z^{\top}Z$ converges in law to the Tracy-Widom distribution with index $1$ , under the above sub-exponential condition. Based on this result, we constructed a goodness-of-fit test for a given set of row and column cluster numbers of an LBM, using the maximum singular value of matrix $\hat{Z}$ , which is an estimator of the matrix $Z$ . We proved that under the null hypothesis (i.e., observed matrix $A$ consists of a given set of row and column cluster numbers), the proposed test statistic $T$ converges in law to the Tracy-Widom distribution with index $1$ (Theorem 4.1). We also showed that under the alternative hypothesis, test statistic $T$ increases in proportion to $m^{\frac{5}{3}}$ with a high probability, where $m$ is a number proportional to the matrix size (Theorems 4.2 and 4.3).

The proposed method solves the limitations of other model selection approaches. (1) Our statistical test method enables us to obtain knowledge about the reliability of the test results. When testing a given set of row and column cluster numbers, we can explicitly set the probability of Type I error (or false positive) as a significance level $\alpha$ . (2) Unlike the other model selection methods, the proposed method does not depend on the clustering algorithm as long as it satisfies the consistency condition (Section 2). It only uses the output of a clustering algorithm to test a given set of cluster numbers; there is no need to modify the test method according to the clustering algorithm. (3) The proposed test method requires relatively small computational complexity. It does not require the MCMC procedure or partitioning into the training and test data sets. For these reasons, the proposed test-based method can be widely used for the purpose of knowledge discovery.

The next sections consist of the detailed explanation of the proposed test method for LBMs. In Section 2, we describe the proposed goodness-of-fit test and its theoretical guarantee with the assumptions required for the problem setting. Next, we briefly review the related works and their differences from the proposed method in Section 3. The main results are presented in Section 4, where we prove the asymptotic properties of the proposed test statistic. In Section 5, we experimentally demonstrate the effectiveness of the proposed test method by showing the asymptotic behavior of the test statistic and calculating the test accuracy. We discuss the results and limitations of the proposed method in Section 6 and conclude the paper in Section 7.

2 Problem setting and statistical model for goodness-of-fit test for latent block models

Let $A\in\mathbb{R}^{n\times p}$ be an $n\times p$ observed matrix. We assume that each entry of matrix $A$ is independently generated, given its row and column clusters. Let $(K,H)$ be the null set of cluster numbers for rows and columns of an observed matrix $A$ , which is unknown in advance. We denote the cluster indices of the $i$ th row and the $j$ th column of matrix $A$ as $g^{(1)}_{i}\in\{1,\dots,K\}$ and $g^{(2)}_{j}\in\{1,\dots,H\}$ , respectively. We assume that each entry of matrix $A$ is independently subject to a distribution with block-wise mean $P$ and block-wise standard deviation $\sigma$ :

[TABLE]

where $B_{kh}$ and $S_{kh}>0$ , respectively, are the mean and the positive standard deviation of entries in the $(k,h)$ th null block under the null hypothesis.

In this paper, we propose a goodness-of-fit test for selecting the cluster numbers $(K,H)$ from observed matrix $A$ . In such a test, we test whether $(K,H)$ is equal to a given set of cluster numbers $(K_{0},H_{0})$ or at least one of the given row and column cluster numbers $K_{0}$ or $H_{0}$ is smaller than the null cluster numbers $K$ or $H$ . In other words, the null (N) and alternative (A) hypotheses are given by

[TABLE]

By sequentially testing the cluster numbers in the following order (Figure 1), we can select the cluster numbers of a given observed matrix $A$ .

Test $(K_{0},H_{0})=(1,1)$ . 2. 2.

Test $(K_{0},H_{0})=(1,2),(2,1)$ . 3. 3.

Test $(K_{0},H_{0})=(1,3),(2,2),(3,1)$ . 4. 4.

$\cdots$ 5. 5.

Test $(K_{0},H_{0})=(1,L),(2,L-1),\dots,(L,1)$ . Let $(\hat{K},\hat{H})$ be the row and column cluster numbers where the null hypothesis is accepted and $\hat{K}+\hat{H}=L+1$ holds. The selected set of cluster numbers is $(\hat{K},\hat{H})$ .

Based on the above sequentially ordered test, selection of the cluster numbers requires $(K+H)(K+H-1)/2$ tests at most.

Assumptions.

Throughout this paper, we make the following assumptions to derive the test statistics:

(i).

We assume that a distribution of $Z_{ij}$ , which is given by $Z_{ij}=(A_{ij}-P_{ij})/\sigma_{ij}$ as in (4) later, has a sub-exponential decay. That is, there exists some $\vartheta>0$ such that for $x>1$ , $\mathrm{Pr}\left(\left|Z_{ij}\right|>x\right)\leq\vartheta^{-1}\exp(-x^{\vartheta})$ . From this assumption, note that for any $n^{\mathrm{M}}\in\mathbb{N}$ , the $n^{\mathrm{M}}$ th moment of a random variable $Z_{ij}$ is finite (i.e., $\mathbb{E}[Z_{ij}^{n^{\mathrm{M}}}]<\infty$ ). 2. (ii).

We denote the number of rows and columns of matrix $A$ as $n$ and $p$ , respectively. We assume that both $n$ and $p$ increase in proportion to some sufficiently large number $m$ (i.e., $n,p\propto m$ ). 3. (iii).

Let $K$ and $H$ , respectively, be the minimum row and column cluster numbers to represent the block structure of observed matrix $A$ under the null hypothesis. We assume that both $K$ and $H$ are finite constants that do not increase with the matrix sizes $n$ and $p$ . We also assume that the minimum row and column sizes of a block in the null block structure, which we denote as $n_{\mathrm{min}}$ and $p_{\mathrm{min}}$ , respectively, satisfy $n_{\mathrm{min}}=\Omega_{p}(m)$ and $p_{\mathrm{min}}=\Omega_{p}(m)$ , where we used the following notation:

[TABLE]

In other words, we assume that with high probability, there is no “too small” block in matrix $A$ . 4. (iv).

If the given set of cluster numbers $(K_{0},H_{0})$ is equal to the null cluster numbers $(K,H)$ , then we call it a realizable case. Otherwise, we call it an unrealizable case ( $K>K_{0}$ or $H>H_{0}$ ). In Section 4, we see that Theorems 4.2 and 4.3 guarantee the behavior of the test statistic $T$ in unrealizable cases. For now, there is no way to detect the cases where $(K<K_{0})\cap(H\leq H_{0})$ or $(K\leq K_{0})\cap(H<H_{0})$ holds, and to cope with such settings is beyond the scope of this paper. 5. (v).

In the realizable case, we assume that a clustering algorithm is consistent, that is, the probability that it outputs the correct block structure converges to $1$ , in the limit of $m\to\infty$ . By using this assumption, the proposed method does not depend on a specific clustering algorithm. Several clustering algorithms including [15, 1, 7] have been proven to be consistent.

3 Relation to existing works

In this section, we briefly review the related works and explain the differences between them and the proposed method.

3.1 Model selection for block models

Statistical-test-based methods (for SBM)

Recently, several methods have been proposed for testing the properties of a given observed matrix in relation to SBMs [5, 29, 24, 21, 53]. Particularly, the methods proposed in [5, 29, 21] have enabled us to estimate the number of blocks for SBMs. However, these methods differ from ours in the problem setting; they can be applied only to an SBM setting, where an observed matrix is a square symmetric matrix, and the cluster assignments are the same for rows and columns. There has been no method for estimating the block number for LBMs, where rows and columns (not necessarily square) of an observed matrix are simultaneously decomposed into clusters.

Cross-validation-based methods

Cross-validation is a widely used method for model selection, where a data set is first split into training and test data sets, and then the best model with the minimum test error is determined. Recently, cross-validation methods for matrix data have been proposed [11, 30, 25, 8] to determine the number of clusters in network data. Although the purpose of these methods and our method is similar, these methods differ from ours in that their target is the network data, where the observed matrix is square and its rows and columns represent the same node sets. Thus, the block structure is symmetric regardless of whether the network itself is directed or undirected). Moreover, unlike a statistical test, these methods cannot provide quantitative knowledge about the reliability of the selected model. Furthermore, the computational cost of cross-validation is generally high because it requires the iterative calculation of the test error with different data set partitions.

Information-criterion-based methods

Another approach for determining the number of blocks in a matrix is to estimate the generalization error or marginal likelihood by some information criterion for given sets of block numbers. By using such information criteria, we can select a model in a statistically meaningful (non-heuristic) way. In regard to block models, many variants of BIC have been proposed [26, 38, 27, 20, 34]. Unlike our test-based method, which only requires a clustering algorithm to satisfy the consistency condition (Section 2), an information criterion for a theoretical guarantee should be carefully chosen according to the given clustering algorithm. For instance, BIC can be employed for estimating the marginal likelihood only if the Fisher information matrix of the model is regular, which is not the case for block models.

To solve this problem, as an alternative criterion to BIC, the integrated completed likelihood (ICL) criterion has been used in many studies for estimating the number of blocks in LBMs [31, 51, 10]. In ICL, we first derive a marginal likelihood for a given set of an observed matrix and block assignments and then substitute the set of estimated block assignments to approximate the marginal likelihood. However, since ICL is computed based on a single estimator of block assignments, there is no guarantee for the goodness of the approximation of marginal likelihood.

Similar to cross-validation-based methods, information-criterion-based methods cannot provide a probabilistic guarantee for the reliability of the selected model, which is a disadvantage for the purpose of knowledge discovery. The computational cost also becomes a problem because the computation of an information criterion requires the process of approximating the posterior distribution by MCMC.

Other model selection methods

Aside from the information criteria, several studies have proposed to determine the number of blocks in LBMs based on the co-clustering adjusted rand index [42], the extended modularity for biclustering [28], or the expected posterior loss for a given loss function [41]. Another approach is to define the posterior distribution not only on cluster assignments of rows and columns but also on row and column cluster numbers [50, 36]. Unlike the model selection approaches, such nonparametric Bayesian methods can estimate the distribution of the block numbers. The best-fitted number of the blocks can be determined based on the posterior distribution (e.g., we can choose a MAP estimator [36]). However, in this case, the computational cost of MCMC is higher than that of the information-criterion-based methods because it requires a large number of iterations to approximate the posterior distribution both on the block assignments and the number of blocks.

4 Test statistic for determining the set of cluster numbers

To derive the test statistic for the proposed goodness-of-fit test, we first normalize each entry $A_{ij}$ of an observed matrix $A$ by subtracting $P_{ij}$ and dividing it by $\sigma_{ij}$ , where $P$ and $\sigma$ , respectively, are the block-wise mean and standard deviation in (2):

[TABLE]

By definition, each entry $Z_{ij}$ of matrix $Z$ in (4) independently follows a distribution with zero mean and standard deviation of one. Therefore, according to the result in [39], if $n=n(p)$ and $n/p\to\gamma\neq 0,\infty$ in the limit of $p\to\infty$ , the scaled maximum eigenvalue $T^{*}$ of matrix $Z^{\top}Z$ converges in law to the Tracy-Widom distribution with index $1$ ( $TW_{1}$ ) in the limit of $p\to\infty$ :

[TABLE]

where $\lambda_{1}$ is the maximum eigenvalue of matrix $Z^{\top}Z$ and

[TABLE]

In most cases, the null cluster numbers $(K,H)$ and the null cluster assignments $g^{(1)}$ and $g^{(2)}$ are unknown in advance. Therefore, we can only estimate the block structure based on the observed matrix $A$ and the given cluster numbers. Let $(K_{0},H_{0})$ be the given set of row and column cluster numbers, and $\hat{g}^{(1)}$ and $\hat{g}^{(2)}$ , respectively, be the estimated cluster assignments for rows and columns. Based on such an estimated block structure $(\hat{g}^{(1)},\hat{g}^{(2)})$ , we estimate the block-wise mean and standard deviation by

[TABLE]

where $I_{k}$ is the set of row indices of matrix $A$ that are assigned to the $k$ th cluster, and $J_{h}$ is the set of column indices of matrix $A$ that are assigned to the $h$ th cluster:

[TABLE]

The consistency assumption (v) guarantees that if $(K_{0},H_{0})=(K,H)$ , the probability that the cluster assignments $(I_{k})_{1\leq k\leq K_{0}}$ and $(J_{h})_{1\leq h\leq H_{0}}$ are correct converges to $1$ in the limit of $m\to\infty$ .

We define an estimator of normalized matrix $Z$ in (4) based on the estimated block-wise mean $\hat{P}$ and standard deviation $\hat{\sigma}$ in (4):

[TABLE]

The test statistic $T$ for the proposed goodness-of-fit test is given by the scaled maximum eigenvalue of matrix $\hat{Z}^{\top}\hat{Z}$ :

[TABLE]

where $\hat{\lambda}_{1}$ is the maximum eigenvalue of matrix $\hat{Z}^{\top}\hat{Z}$ , and $a$ and $b$ are given by (6).

Based on the following results in Theorems 4.1, 4.2 and 4.3, we propose a one-sided goodness-of-fit test for a given set of cluster numbers $(K_{0},H_{0})$ at the significance level of $\alpha$ by using the test statistic $T$ :

[TABLE]

where $t(\alpha)$ is the $\alpha$ upper quantile of the Tracy-Widom distribution with index $1$ . By applying the sequentially ordered test that we explained in Section 2 based on the above rejection rule (11), we can select a set of row and column cluster numbers $(\hat{K},\hat{H})$ for a given observed matrix $A$ .

In the proof of Theorem 4.1, we also use the following notations. Let $\tilde{B}_{kh}$ and $\tilde{S}_{kh}$ , respectively, be the sample mean and standard deviation of all the entries in the $(k,h)$ th null block in matrix $A$ . Based on such notations, we define the sample mean matrix $\tilde{P}$ and standard deviation matrix $\tilde{\sigma}$ for the correct block structure, and matrix $\tilde{Z}$ by:

[TABLE]

Theorem 4.1 (Realizable case).

We assume that the following condition holds: $n=n(p)$ , $n/p\to\gamma\neq 0,\infty$ in the limit of $p\to\infty$ . Under the consistency assumption (v) for the clustering algorithm, if $(K_{0},H_{0})=(K,H)$ ,

[TABLE]

in the limit of $p\to\infty$ , where $T$ is defined as in (10).

Proof.

We denote the operator norm by $\|\cdot\|_{\mathrm{op}}$ ,

[TABLE]

First of all, we derive the difference between $B_{kh}$ ( $S_{kh}$ ) and $\tilde{B}_{kh}$ ( $\tilde{S}_{kh}$ ), which have been defined in (2) and (4). Since the number of entries in the block is proportional to $m^{2}$ by the assumption (iii), $\sqrt{m^{2}}\left(B_{kh}-\tilde{B}_{kh}\right)$ converges to $\mathcal{N}(0,S_{kh}^{2})$ from the central limit theorem. Therefore, from Prokhorov’s theorem [48], we have

[TABLE]

Also, the following equation holds (The proof is given in Appendix A):

[TABLE]

From here, we derive the difference between the maximum eigenvalue $\tilde{\lambda}_{1}$ of matrix $\tilde{Z}^{\top}\tilde{Z}$ and the maximum eigenvalue $\lambda_{1}$ of matrix $Z^{\top}Z$ , where the definitions of matrices $Z$ and $\tilde{Z}$ have been given in (4) and (4), respectively. From (5), we have $\lambda_{1}=O_{p}(m)$ . Therefore, the largest singular value of matrix $Z$ , which is equal to $\|Z\|_{\mathrm{op}}$ , is in the order of $O_{p}(\sqrt{m})$ .

By the subadditivity of the operator norm, we have

[TABLE]

Let $A^{(k,h)}$ , $P^{(k,h)}$ , $\tilde{P}^{(k,h)}$ , $Z^{(k,h)}$ and $\tilde{Z}^{(k,h)}$ , respectively, be the $(k,h)$ th null blocks of matrices $A$ , $P$ , $\tilde{P}$ , $Z$ and $\tilde{Z}$ . We also denote the row and column sizes of the $(k,h)$ th null block as $n_{k}$ and $p_{h}$ , respectively. From the definitions in (4) and (4), we have

[TABLE]

Combining this with (15), (16) and the fact that the Frobenius norm upper bounds the operator norm, we have

[TABLE]

Therefore, since the operator norm of a matrix is not larger than the sum of the operator norms of all of its blocks and the number of blocks are finite constants, we have

[TABLE]

By combining this with (17), we obtain

[TABLE]

Next, we consider the joint probability of the event $\mathcal{F}_{m}$ that the clustering algorithm outputs the correct block structure (i.e., $\tilde{Z}=\hat{Z}$ ) and the event $\mathcal{G}_{m,C}$ that $\left|\|Z\|_{\mathrm{op}}-\|\tilde{Z}\|_{\mathrm{op}}\right|\leq C$ holds. Such a joint probability satisfies the following inequality:

[TABLE]

where $\mathcal{A}^{\mathrm{C}}$ is the complement of event $\mathcal{A}$ . The consistency assumption (v) guarantees that if $(K_{0},H_{0})=(K,H)$ , $\mathrm{Pr}\left(\mathcal{F}^{\mathrm{C}}_{m}\right)$ converges to [math] in the limit of $m\to\infty$ . By combining this fact with (21), we obtain

[TABLE]

which results in

[TABLE]

By using the above results, we can prove that the following equation holds for all $\epsilon\in\left(0,\frac{2}{7}\right)$ (The proof is given in Appendix B):

[TABLE]

From (5), (25), and Slutsky’s theorem, by setting $\epsilon<\frac{1}{21}$ ,

[TABLE]

This is equivalent to the statement of Theorem 4.1. ∎

Theorem 4.2 (Unrealizable case, lower bound).

Suppose $K_{0}<K$ or $H_{0}<H$ .

[TABLE]

where $T$ is defined as in (10).

Proof.

Let $\bar{P}$ be a matrix that consists of the estimated block structure and whose entries are the population block-wise means, which can be calculated using $P$ (see also Figure 2).

To derive the difference between matrices $P$ and $\hat{P}$ , we first focus on the relationship between matrices $P$ and $\bar{P}$ . In the unrealizable case (i.e., $K_{0}<K$ or $H_{0}<H$ ), we can assume $K_{0}<K$ without loss of generality.

Let $n_{k}$ be the number of rows in the $k$ th null row cluster. For all the null row cluster indices $k\in\{1,\dots,K\}$ , at least one estimated row cluster contains $n_{k}/K_{0}$ or more rows that are assigned to the $k$ th row cluster in the null block structure (otherwise, the total number of rows in the $k$ th null row cluster is smaller than $n_{k}$ ). Since $K_{0}<K$ , at least one estimated block contains two or more sets of rows whose null row clusters are mutually different, and both of which have the row sizes of at least $n_{\mathrm{min}}/K_{0}$ , where $n_{\mathrm{min}}$ is the minimum row size of a block in the null block structure. By the same reasoning, for all the null column cluster indices $h\in\{1,\dots,H\}$ , at least one estimated column cluster contains $p_{h}/H_{0}$ or more columns that are assigned to the $h$ th column cluster in the null block structure, where $p_{h}$ is the number of rows in the $h$ th null column cluster. By combining these facts, there exists at least one estimated block that contains two or more submatrices, both of which have the sizes of at least $(n_{\mathrm{min}}/K_{0})\times(p_{\mathrm{min}}/H_{0})$ and whose null blocks are mutually different.

Let $X_{1}$ and $X_{2}$ be such submatrices, whose null block-wise mean are $q_{1}$ and $q_{2}$ , respectively. We can assume $q_{1}>q_{2}$ without the loss of generality. In matrix $\bar{P}$ , which has the estimated block structure, both of $X_{1}$ and $X_{2}$ have the same values $\bar{q}$ . Here, $|q_{2}-\bar{q}|\geq\frac{|q_{1}-q_{2}|}{2}$ holds if $\bar{q}\geq\frac{q_{1}+q_{2}}{2}$ , and $|q_{1}-\bar{q}|\geq\frac{|q_{1}-q_{2}|}{2}$ otherwise. Therefore, for any $\bar{q}$ , there exists at least one submatrix $\bar{X}$ (which is either $X_{1}$ or $X_{2}$ ) with a size of at least $(n_{\mathrm{min}}/K_{0})\times(p_{\mathrm{min}}/H_{0})$ , where all the entries are $q$ (which is either $q_{1}$ or $q_{2}$ ) in matrix $P$ and

[TABLE]

Let $(k_{1},h_{1})$ be the row and column cluster indices of the estimated block which contains the above submatrix $\bar{X}$ . We denote the row and column sizes of the $(k_{1},h_{1})$ th estimated block as $\underline{n}_{1}$ and $\underline{p}_{1}$ , respectively. Let $\underline{A}^{(k_{1},h_{1})}$ , $\underline{P}^{(k_{1},h_{1})}$ , $\underline{\bar{P}}^{(k_{1},h_{1})}$ , and $\underline{\hat{P}}^{(k_{1},h_{1})}$ , respectively, be the $(k_{1},h_{1})$ th estimated block of $A$ , $P$ , $\bar{P}$ , and $\hat{P}$ . We define $\hat{q}\equiv\hat{B}_{k_{1}h_{1}}$ . In regard to the difference between matrices $\bar{P}$ and $\hat{P}$ (both of which have the estimated block structure), we have

[TABLE]

where $A^{(k,h)}$ , $P^{(k,h)}$ , and $Z^{(k,h)}$ , respectively, are the $(k,h)$ th null blocks of matrices $A$ , $P$ , and $Z$ , and $\bm{u}_{1}=[1,1,\dots,1]^{\top}\in\mathbb{R}^{\underline{n}_{1}}$ and $\bm{u}_{2}=[1,1,\dots,1]^{\top}\in\mathbb{R}^{\underline{p}_{1}}$ . To derive the final equation in (4), we used the assumption that $n_{\mathrm{min}},p_{\mathrm{min}}=\Omega_{p}(m)$ and the fact that $\|Z\|_{\mathrm{op}}$ is equal to the largest singular value of $Z$ , which is $O_{p}(\sqrt{m})$ from (5).

Let $\mathcal{E}_{m,C}$ be the event that $|q-\bar{q}|-CKH/\sqrt{m}\leq|q-\hat{q}|$ holds. For all $q$ , $\bar{q}$ , and $\hat{q}$ , the following inequality holds:

[TABLE]

By combining (4) and (30), we obtain

[TABLE]

From now on, we denote the row and column sizes of submatrix $\bar{X}$ , respectively, by $\bar{n}_{1}$ and $\bar{p}_{1}$ . Let $A^{*}$ , $P^{*}$ , $\bar{P}^{*}$ , $\hat{P}^{*}$ , $Z^{*}$ , and $\hat{Z}^{*}$ , respectively, be the submatrices of matrices $A$ , $P$ , $\bar{P}$ , $\hat{P}$ , $Z$ , and $\hat{Z}$ with the same row and column indices as submatrix $\bar{X}$ . We also denote the constant entries of the submatrices of $\sigma$ and $\hat{\sigma}$ with the same row and column indices as submatrix $\bar{X}$ , respectively, as $\sigma^{*}$ and $\hat{\sigma}^{*}$ . From the definition (9) and since the operator norm of a submatrix is not larger than that of the original matrix, we have

[TABLE]

First, the order of the estimated standard deviation $\hat{\sigma}^{*}$ is given by

[TABLE]

The proof of (33) is in Appendix C.

The only non-zero (and thus, the largest) singular value of matrix $\left(P^{*}-\hat{P}^{*}\right)$ is $\sqrt{\bar{n}_{1}\bar{p}_{1}}\left|q-\hat{q}\right|$ . Since the largest singular value of a matrix is equal to its operator norm, we have

[TABLE]

Therefore, by combining this fact with (4), if the statement of event $\mathcal{E}_{m,C}$ holds, the following inequality also holds:

[TABLE]

which results in that $\|P^{*}-\hat{P}^{*}\|_{\mathrm{op}}=\Omega_{p}\left(\delta m\right)$ , where $\delta\equiv\min\{\Delta B^{(1)},\Delta B^{(2)}\}$ .

Also, from (5), we have $\|Z^{*}\|_{\mathrm{op}}\leq\|Z\|_{\mathrm{op}}=O_{p}(\sqrt{m})$ . By substituting this fact, (33), and (35) into (4), we finally obtain

[TABLE]

Here, $\|\hat{Z}\|_{\mathrm{op}}^{2}$ is equal to the maximum eigenvalue $\hat{\lambda}_{1}$ of matrix $\hat{Z}^{\top}\hat{Z}$ , and the test statistic is $T=\frac{\hat{\lambda}_{1}-a}{b}$ . Using the definition (6), we obtain $a=O_{p}(m)$ and

[TABLE]

where we used the definitions $\beta_{1}\equiv n/m$ and $\beta_{2}\equiv p/m$ .

By combining these results and (36), we obtain

[TABLE]

which concludes the proof. ∎

Theorem 4.3 (Unrealizable case, upper bound).

Suppose $K_{0}<K$ or $H_{0}<H$ . Then,

[TABLE]

where $T$ is defined as in (10).

Proof.

We define $P$ , $\bar{P}$ , and $\hat{P}$ as in Theorem 4.2. Let $\underline{\hat{Z}}^{(k,h)}$ , $\underline{A}^{(k,h)}$ , and $\underline{\hat{P}}^{(k,h)}$ , respectively, be the $(k,h)$ th estimated blocks of matrices $\hat{Z}$ , $A$ , and $\hat{P}$ . We denote the row and column sizes of the $(k,h)$ th estimated block as $\underline{n}_{k}$ and $\underline{p}_{h}$ , respectively. Since the operator norm of a matrix is not larger than the sum of the operator norms of all its blocks, we have

[TABLE]

The test statistic is $T=\frac{\hat{\lambda}_{1}-a}{b}$ , where $\hat{\lambda}_{1}=\|\hat{Z}\|_{\mathrm{op}}^{2}=O_{p}(m^{2})$ . Based on the same discussion as in Theorem 4.2, $a=O_{p}(m)$ and (4) hold. Consequently, we obtain $T=O_{p}(m^{2}/m^{\frac{1}{3}})=O_{p}(m^{\frac{5}{3}})$ , which concludes the proof. ∎

5 Experiments

5.1 Realizable case: convergence of test statistic $T$ in law to Tracy-Widom distribution

First of all, we checked the convergence of the proposed test statistic $T$ in law to the Tracy-Widom distribution with index $1$ , under the realizable setting, which has been stated in Theorem 4.1, by using synthetic data that were generated based on three types of distributions:

•

Gaussian Latent Block Model: The observed matrices were generated whose entries in the $(k,h)$ th block follows the normal distribution $\mathcal{N}(B_{kh},S_{kh})$ . In the Gaussian LBM setting, we used the following null model and parameters:

[TABLE]

•

Bernoulli Latent Block Model The observed matrices were generated whose entries in the $(k,h)$ th block follows the normal distribution $\mathrm{Bernoulli}(B_{kh})$ . In the Bernoulli LBM setting, we used the following null model and parameters:

[TABLE]

•

Poisson Latent Block Model The observed matrices were generated whose entries in the $(k,h)$ th block follows the normal distribution $\mathrm{Pois}(B_{kh})$ . In the Poisson LBM setting, we used the following null model and parameters:

[TABLE]

Based on the above Latent Block Model, we randomly generated $1000$ observed matrices, estimated their block structures based on the Ward’s hierarchical clustering algorithm [49], and computed the test statistic $T$ . With respect to the matrix size, we tried the following $10$ settings: $(n,p)=(300\times i,225\times i)$ , $i=1,\dots,10$ . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on $\{1,2,3,4\}$ . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on $\{1,2,3\}$ .

Figures 5, 5, and 5, respectively, show the Q-Q plots of the test statistic $T$ and the $TW_{1}$ distribution in the settings of Gaussian, Bernoulli, and Poisson settings. Each plotted point corresponds to a sample of test statistic $T$ , and the horizontal and vertical lines, respectively, show its theoretical and sample quantiles. These figures show that the test statistic converged in law to the $TW_{1}$ distribution.

Figure 7 shows the ratios of the trials where $T\geq t(0.01)$ , $T\geq t(0.05)$ , and $T\geq t(0.1)$ for the above three distributional settings, where $t(\alpha)$ is the $\alpha$ upper quantile of the $TW_{1}$ distribution. We used the approximated values $t(0.01)\approx 2.02345$ , $t(0.05)\approx 0.97931$ , and $t(0.1)\approx 0.45014$ , according to Table $2$ in [47]. From Figure 7, we see that the tail probability of the test statistic $T$ also converged to those of the $TW_{1}$ distributions for all of the three distributional settings.

We also plotted the results of the Kolmogorov-Smirnov test [9] for the test statistic $T$ in Figure 7. We tested whether the distribution of $T$ is the $TW_{1}$ distribution or not based on the test statistic $D\sqrt{r}$ , where $D$ is the maximum absolute difference between the empirical distribution function of $T$ and the cumulative distribution function of the $TW_{1}$ distribution, and $r$ is the sample size, which is set at $1000$ in this experiment. Figure 7 shows the convergence of the proposed test statistic $T$ in law to the $TW_{1}$ distribution under the realizable setting.

5.2 Unrealizable case: asymptotic behavior of test statistic $T$

Next, we checked the asymptotic behavior of the proposed test statistic $T$ under the unrealizable setting, which has been stated in Theorems 4.2 and 4.3, by using synthetic data that were generated based on the same three types of distributions as in Section 5.1. By combining Theorems 4.2 and 4.3, we obtain the following theorem:

Theorem 5.1 (Unrealizable case, two-sided bound).

Suppose $K_{0}<K$ or $H_{0}<H$ . Then,

[TABLE]

In other words, with high probability, the proposed test statistic $T$ increases in proportion to $p^{\frac{5}{3}}$ in the limit of $p\to\infty$ , since we have assumed that $p\propto m$ .

With respect to the null models and parameters, we used the same settings as in Section 5.1 for all of the three distributional settings (i.e., Gaussian, Bernoulli, and Poisson LBMs). Based on such settings, we randomly generated $100$ observed matrices, estimated their block structures based on the Ward’s hierarchical clustering algorithm [49], and computed the test statistic $T$ . With respect to the matrix size, we tried the following $10$ settings: $(n,p)=(200\times i,150\times i)$ , $i=1,\dots,10$ . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on $\{1,2,3,4\}$ . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on $\{1,2,3\}$ .

Figures 9 and 9 show the asymptotic behavior of the proposed test statistic $T$ under the unrealizable setting. As shown in Theorem 5.1, we see that $T$ increases in proportion to $m^{\frac{5}{3}}$ , where $n,p\propto m$ .

5.3 Accuracy of the proposed goodness-of-fit test

Finally, we evaluated the proposed goodness-of-fit test in terms of its accuracy. By using synthetic data that were generated based on the same three types of distributions as in Section 5.1, we checked the ratio of trials where the selected set of cluster numbers $(K_{0},H_{0})$ is equal to the null one $(K,H)$ . Here, we set the null set of cluster numbers at $(K,H)=(4,3)$ . For each distributional setting (i.e., Gaussian, Bernoulli, and Poisson LBMs), we tried $10$ settings with respect to the block-wise mean $B$ . The concrete settings were as follows:

•

Gaussian Latent Block Model: We used the following parameters:

[TABLE]

•

Bernoulli Latent Block Model We used the following parameters:

[TABLE]

•

Poisson Latent Block Model We used the following parameters:

[TABLE]

With respect to the matrix size, we tried the following $10$ settings for each distributional setting and for each setting of $B$ : $(n,p)=(40\times i,30\times i)$ , $i=1,\dots,10$ . When generating an observed matrix, the null cluster of each row was randomly chosen from the discrete uniform distribution on $\{1,2,3,4\}$ . Similarly, the null cluster of each column was randomly chosen from the discrete uniform distribution on $\{1,2,3\}$ . In each of $3$ (Gaussian, Bernoulli, or Poisson LBM) $\times 10$ (for the setting of $B$ ) $\times 10$ (for the setting of matrix size) settings, we generated $1000$ observed matrices and applied the proposed sequential goodness-of-fit test, until the null hypothesis $(K,H)=(K_{0},H_{0})$ was accepted. For each observed matrix, we estimated their block structures based on the Ward’s hierarchical clustering algorithm [49] under each setting of a hypothetical set of cluster numbers $(K_{0},H_{0})$ , computed the test statistic $T$ , and performed the proposed test for the given cluster numbers $(K_{0},H_{0})$ using a significance level of $\alpha=0.01$ . Figures 12, 12, 12, respectively, show the examples of generated observed matrices of Gaussian, Bernoulli, and Poisson LBMs.

Figure 13 shows the accuracy of the proposed test under $10$ different settings of block-wise mean $B$ . From Figure 13, we see that the test accuracy increases with matrix size $n$ for a fixed block-wise mean $B$ , and that it decreases with the smaller differences between the block-wise means for a fixed matrix size $n$ .

Comparison to the integrated completed likelihood (ICL)

We also checked the difference in the behavior of the proposed test and the ICL. For the Bernoulli LBM, we can compute the asymptotic ICL [26] by assuming the following model:

[TABLE]

where $p(\cdot)$ represents a probability density, and $a_{\mathrm{D}}$ and $b_{\mathrm{B}}$ are the hyperparameters.

From Lemma 4.2 in [26], for an estimated block structure $(\hat{g}^{(1)},\hat{g}^{(2)})$ , the resulting asymptotic ICL is given by

[TABLE]

The proof of (5.3) is given in Appendix D.

To check the accuracy of the proposed test and the ICL, we generated synthetic binary data matrices based on the Bernoulli distribution as in Section 5.3, and checked the ratio of trials where the selected set of cluster numbers $(K_{0},H_{0})$ is equal to the null one $(K,H)$ . We set the null set of cluster numbers at $(K,H)=(4,3)$ , and tried the following five settings with respect to the block-wise mean $B$ .

[TABLE]

With respect to the matrix size, we tried the following five settings for each setting of $B$ : $(n,p)=(40\times i,30\times i)$ , $i=1,\dots,5$ . The null block of each element was chosen in the same way as in Section 5.3. In each of $5$ (for the setting of $B$ ) $\times 5$ (for the setting of matrix size) settings, we generated $100$ observed matrices, and applied the proposed test using a significance level of $\alpha=0.01$ and the model selection based on the ICL. Unlike the proposed sequential test, which stopped if the null hypothesis was accepted, the ICL was computed for all the sets of cluster numbers from $(1,1)$ to $(n,p)$ and then the optimal setting was selected that achieved the maximum ICL. For each setting, we estimated the block structure of an observed matrix based on the Ward’s hierarchical clustering algorithm [49].

Figure 14 shows the accuracy of the proposed test and the model selection based on the ICL. Although the purpose of the proposed test is not to achieve high accuracy in model selection, in some cases with small differences between the block-wise means $\{B_{kh}\}$ , it achieved better performance than the ICL. With larger difference between $\{B_{kh}\}$ , the ICL performed better than the proposed test in terms of model selection. Figures 15 and 16, respectively, show the ratios of the trials where each set of cluster numbers was selected by the proposed test and the ICL. From Figures 15 and 16, we see that in most cases (e.g., $B_{kh}\in[0.26,0.74]$ for all $(k,h)$ ), the ICL tended to select smaller sets of cluster numbers than the proposed test.

5.4 Real data analysis: Congressional Voting Records Data Set

We also checked the result when we applied the proposed test to 1984 United States Congressional Voting Records Database from UCI Machine Learning Repository [14]. The original data set contains three types of votes (“yea,” “nay,” and unknown) for the pairs of a congressman and an attribute. We treated unknown as “nay,” as in [51]. The number of instances or congressmen and that of attributes are $435$ and $16$ , respectively. Based on this data set, we defined a binary matrix $A\in\mathbb{R}^{435\times 16}$ , where the elements of one and zero, respectively, correspond to “yea” and “nay.”

As in Section 5.3, we applied the proposed sequential tests using a significance level of $\alpha=0.01$ , until the null hypothesis was accepted. We also computed the ICL for each setting of a hypothetical set of cluster numbers $(K_{0},H_{0})$ , and selected one with the largest ICL. For each setting of a hypothetical set of cluster numbers $(K_{0},H_{0})$ , we estimated the block structure based on the Ward’s hierarchical clustering algorithm [49].

As a result, the sets of cluster numbers $(9,14)$ and $(3,13)$ were selected by the proposed test and the ICL, respectively. Figure 17 shows the observed data matrix and its estimated block structures with the selected sets of cluster numbers. From Figure 17, we see that a finer block structure was accepted by the proposed test than the ICL, particularly for the row (i.e., congressman) cluster assignments. As for the column (i.e., attribute) cluster assignments, “anti-satellite-test-ban,” “aid-to-nicaraguan-contras,” and “mx-missile” were assigned into the same cluster in the selected block structure of the ICL, whereas the proposed test distinguished the first two attributes from the last one. Figure 18 shows the $p$ -value of the proposed test and the ICL for each setting of a hypothetical set of cluster numbers $(K_{0},H_{0})$ until the null hypothesis was accepted.

6 Discussion

In this section, we discuss the proposed test method in terms of the test statistic and the conditions for the generative model.

With respect to the asymptotic behavior, the proposed test has a favorable property in terms of the power. From Theorem 5.1, under the alternative hypothesis, the test statistic $T$ increases in proportion to $m^{\frac{5}{3}}$ with high probability, where $n,p\propto m$ . In other words, the probability that the test makes a type II error (i.e., $T<t(\alpha)$ ) converges to zero in the limit of $p\to\infty$ . Based on this fact, in the asymptotic sense, we do not need to consider the correction for the multiple comparison when applying the proposed sequential testing. However, it has not been shown what occurs in the non-asymptotic setting. In general, practical data matrices have finite sizes, where there has been shown no theoretical guarantee like Theorems 4.1, 4.2, and 4.3. On the other hand, for a Gaussian case (i.e., each entry of a matrix independently follows $\mathcal{N}(0,1)$ ), the following statement holds [32]: Suppose $n=n(p)>p$ and $n/p\to\gamma\in[1,\infty)$ in the limit of $p\to\infty$ . Then, for any $s_{0}$ , there exists $N_{0}\in\mathbb{N}$ such that when $\max(n,p)\geq N_{0}$ and $\max(n,p)$ is even, for all $s\geq s_{0}$ ,

[TABLE]

where $T^{*}$ is defined as in (5) and $C(\cdot)$ is a continuous and non-increasing function. From the above inequality (51), if the clustering algorithm outputs the correct block assignments, the convergence rate of the normalized maximum eigenvalue $T^{*}$ of matrix $\tilde{Z}^{\top}\tilde{Z}$ (where $\tilde{Z}$ is defined as in (4)) to the Tracy-Widom distribution with index $1$ is $O(m^{-2/3})$ . However, since the distribution of $T$ is unknown in the case where the correct block assignment is not obtained, the convergence rate of $T$ is also unknown. Deriving the convergence rate of $T$ by considering the above discussion is a future research topic.

In regard to the conditions for using the proposed test method, our proposed test is applicable to a wide range of practical distributional settings (e.g., Bernoulli distribution for binary data matrices and Poisson distribution for sparse ones). Nevertheless, it still requires some assumptions for the latent block structure of an observed matrix. For instance, the row and column cluster numbers $(K,H)$ should be constants that do not increase with the matrix sizes $n$ and $p$ . Also, there should be no too small block (i.e., $n_{\mathrm{min}}=\Omega_{p}(m)$ and $p_{\mathrm{min}}=\Omega_{p}(m)$ ). In some practical cases, where it is more appropriate to assume that the number of blocks increases with the matrix size, it will be useful to construct a test which does not require the above conditions. As for the sub-exponential condition, Ding and Yang [13] have shown more relaxed sufficient condition for the scaled maximum eigenvalue $T^{*}$ to converge in law to $TW_{1}$ distribution. However, the delocalization property of an eigenvector of matrix $Z^{\top}Z$ [6], which we used in Appendix B to prove our main result, has not been derived in the form as in the sub-exponential case [6]. If (74) in Appendix B is shown in the above more general case, it would also be possible to extend our proposed test to such a case. Furthermore, there are proposed variants of latent block models with which we assume different block structures from a regular grid [43, 35]. To construct test methods for the above settings is an important topic for future research.

7 Conclusion

Latent block models are effective tools for biclustering, where rows and columns of an observed matrix are simultaneously decomposed into clusters. Such a bicluster structure appears in various types of relational data, such as the customer-product transaction data or and the document-word relationship data. One open problem in using latent block models is that there has been no statistical test method for determining the number of blocks. In this study, we developed a goodness-of-fit test for latent block models based on a result from the random matrix theory. By defining the test statistic $T$ based on the estimators of the block-wise means and standard deviations, we have derived its asymptotic behavior in both realizable (i.e., $(K,H)=(K_{0},H_{0})$ ) and unrealizable (i.e., $K>K_{0}$ or $H>H_{0}$ ) cases. Particularly, it has been shown that the test statistic $T$ converges in law to Tracy-Widom distribution with index $1$ in the realizable case. Based on these results, it was made possible to test whether the given observed matrix had $K_{0}\times H_{0}$ latent blocks or more ones. In the experiments, we showed the validity of the proposed test method in terms of both the asymptotic behavior of the test statistic and the test accuracy by using synthetic data matrices with ground truth block structures.

Acknowledgments

TS was partially supported by JSPS KAKENHI (18K19793, 18H03201, and 20H00576), Japan Digital Design, and JST CREST.

Appendix A Proof of $\left|\tilde{S}_{kh}-S_{kh}\right|=O_{p}\left(\frac{1}{m}\right)$ .

Let $n_{k}$ and $p_{h}$ , respectively, be the row and column sizes of the $(k,h)$ th null block, and $A^{(k,h)}$ , $P^{(k,h)}$ and $\tilde{P}^{(k,h)}$ , respectively, be the $(k,h)$ th null blocks of matrices $A$ , $P$ and $\tilde{P}$ . Here, we prove the following lemma:

Lemma A1.

Under the assumption that the fourth moment of the noise $Z_{ij}$ is bounded ( $\mathbb{E}[Z_{ij}^{4}]<\infty$ ),

[TABLE]

where $\tilde{S}_{kh}=\sqrt{\frac{1}{n_{k}p_{h}}\sum_{i=1}^{n_{k}}\sum_{j=1}^{p_{h}}\left(A^{(k,h)}_{ij}-\tilde{B}_{kh}\right)^{2}}$ .

Proof.

From the above definition of $\tilde{S}_{kh}$ , we have

[TABLE]

To derive the second equation, we used the fact that $\tilde{B}_{kh}=\frac{1}{n_{k}p_{h}}\sum_{i=1}^{n_{k}}\sum_{j=1}^{p_{h}}A^{(k,h)}_{ij}$ . Therefore, the following inequality holds:

[TABLE]

The first term in (54) is given by

[TABLE]

where we defined that $Y^{(k,h)}_{ij}\equiv\left(A^{(k,h)}_{ij}-B_{kh}\right)^{2}-S_{kh}^{2}$ . Note that $(Y^{(k,h)}_{ij})_{1\leq i\leq n_{k},1\leq j\leq p_{h}}$ is independent. The expectation and the variance of $Y^{(k,h)}_{ij}$ are given by

[TABLE]

From (A), we have

[TABLE]

From (A), (A), and Chebyshev’s inequality, for all $t>0$ ,

[TABLE]

Therefore, we have

[TABLE]

On the other hand, the second term in (54) is given by

[TABLE]

From (A), we have

[TABLE]

where $X_{kh}\equiv\sum_{i=1}^{n_{k}}\sum_{j=1}^{p_{h}}Z^{(k,h)}_{ij}$ . Here, $\mathbb{E}[X_{kh}]=0$ , and $\mathbb{E}\left[X_{kh}^{2}\right]=\mathbb{V}[X_{kh}]=n_{k}p_{h}$ . By substituting this into (61), we obtain

[TABLE]

From Markov’s inequality and (62),

[TABLE]

Therefore, we have

[TABLE]

By combining (54), (59), and (64),

[TABLE]

The difference between $\tilde{S}_{kh}$ and $S_{kh}$ is given by

[TABLE]

Here, from (65), $m\left|\tilde{S}_{kh}^{2}-S_{kh}^{2}\right|$ is bounded in probability. Therefore, $\tilde{S}_{kh}$ converges in probability to $S_{kh}$ . By combining this fact with (65) and (66), we finally obtain

[TABLE]

which concludes the proof. ∎

Appendix B Proof of $\|\hat{Z}\|_{\mathrm{op}}^{2}=\|Z\|_{\mathrm{op}}^{2}+O_{p}\left(m^{\frac{2}{7}}\right)$ in realizable case

We first derive the relationship between the maximum eigenvalues of matrices $Z^{\top}Z$ and $\tilde{Z}^{\top}\tilde{Z}$ in Lemma B1 and B2.

Lemma B1.

Let $\lambda_{1}$ and $\tilde{\lambda}_{1}$ , respectively, be the maximum eigenvalues of matrices $Z^{\top}Z$ and $\tilde{Z}^{\top}\tilde{Z}$ (i.e., $\|Z\|_{\mathrm{op}}^{2}$ and $\|\tilde{Z}\|_{\mathrm{op}}^{2}$ , respectively). Then, for all $\epsilon\in\left(0,\frac{1}{2}\right)$ , the following equation holds:

[TABLE]

Proof.

Let $\bm{v}$ and $\tilde{\bm{v}}$ , respectively, be the normalized eigenvectors of $Z^{\top}Z$ and $\tilde{Z}^{\top}\tilde{Z}$ , corresponding to the maximum eigenvalues $\lambda_{1}$ and $\tilde{\lambda}_{1}$ :

[TABLE]

Since $\sqrt{\tilde{\lambda}_{1}}$ is the largest singular value of matrix $\tilde{Z}$ , we have

[TABLE]

We also define the following matrix $Q^{(k,h)}$ for each $(k,h)$ th block:

[TABLE]

where $n_{k}$ and $p_{h}$ , respectively, are the row and column sizes of the $(k,h)$ th null block.

Let $\underline{Z}^{(k,h)}$ , $\underline{\tilde{Z}}^{(k,h)}$ , and $\underline{Q}^{(k,h)}$ , respectively be $n\times p$ matrices whose $(k,h)$ th null blocks are $Z^{(k,h)}$ , $\tilde{Z}^{(k,h)}$ and $Q^{(k,h)}$ and whose all the other entries are zero. As shown in Figure 19, we define matrix $Q$ as $Q\equiv\sum_{k=1}^{K}\sum_{h=1}^{H}\underline{Q}^{(k,h)}$ .

From (70), we have

[TABLE]

where $\lambda_{1}^{(k,h)}$ is the maximum eigenvalue of matrix $\left(Z^{(k,h)}\right)^{\top}Z^{(k,h)}$ .

From now on, we prove that $\|Q\bm{v}\|=O_{p}\left(\frac{1}{\sqrt{m}}\right)$ and $\|\underline{Q}^{(k,h)}\bm{v}\|=O_{p}\left(\frac{1}{\sqrt{m}}\right)$ . We use the following notations:

[TABLE]

where $v_{j}$ is the $j$ th entry of vector $\bm{v}$ . Note that the $(k,h)$ th block of matrix $Q$ , the $(h,h^{\prime})$ th block of matrix $Q^{\top}Q$ , and the $h$ th block of vector $Q^{\top}Q\bm{v}$ are given by $\nu_{kh}\begin{bmatrix}1&\cdots&1\\ \vdots&&\vdots\\ 1&\cdots&1\end{bmatrix}$ , $\omega_{hh^{\prime}}\begin{bmatrix}1&\cdots&1\\ \vdots&&\vdots\\ 1&\cdots&1\end{bmatrix}$ , and $\zeta_{h}\begin{bmatrix}1\\ \vdots\\ 1\end{bmatrix}$ , respectively. Let $\bm{u}^{(h)}\in\mathbb{R}^{p}$ be a vector whose entries in the $h$ th column cluster are $\frac{1}{\sqrt{p_{h}}}$ and whose all the other entries are zero. Here, from Theorem 2.17 in [6], each $j$ th eigenvector $\bm{v}_{j}$ of matrix $Z^{\top}Z$ has a delocalization property, that is, for any constant vector $\bm{w}$ that satisfies $\|\bm{w}\|=1$ ,

[TABLE]

From Theorem 2.20 in [6], (74) holds uniformly in $j$ and $\bm{w}$ .

Note that this theorem holds in our case, where we assume that $n,p\propto m$ and that each entry of $Z$ is independently generated from a distribution with zero mean and unit variance that satisfies the sub-exponential condition. Therefore, we have $|\bm{v}^{\top}\bm{u}^{(h)}|=O_{p}\left(m^{-\frac{1}{2}+\epsilon}\right)$ for all $\epsilon>0$ . Since $Q^{\top}Q\bm{v}=\sum_{h=1}^{H}\zeta_{h}\sqrt{p_{h}}\bm{u}^{(h)}$ and $\nu_{kh}=O_{p}\left(\frac{1}{m}\right)$ , $\omega_{hh^{\prime}}=O_{p}\left(\frac{1}{m}\right)$ , $\zeta_{h}=\sum_{h^{\prime}=1}^{H}\omega_{hh^{\prime}}\sqrt{p_{h^{\prime}}}\bm{v}^{\top}\bm{u}^{(h^{\prime})}=O_{p}\left(m^{-1+\epsilon}\right)$ for all $\epsilon>0$ by definition, the following equation holds:

[TABLE]

Similarly, $(h,h)$ th block of matrix $(\underline{Q}^{(k,h)})^{\top}\underline{Q}^{(k,h)}$ is $n_{k}\nu_{kh}^{2}\begin{bmatrix}1&\cdots&1\\ \vdots&&\vdots\\ 1&\cdots&1\end{bmatrix}$ , and its all the other entries are zero, which results in that $(\underline{Q}^{(k,h)})^{\top}\underline{Q}^{(k,h)}\bm{v}=n_{k}\nu_{kh}^{2}\left(\sum_{j\in J_{h}}v_{j}\right)\sqrt{p_{h}}\bm{u}^{(h)}$ . Therefore, we have

[TABLE]

Moreover, from Lemma A1, we have

[TABLE]

By substituting (B), (B), and (77) into (B) and by setting $\epsilon<\frac{1}{2}$ , we have

[TABLE]

which concludes the proof. ∎

Lemma B2.

Let $\lambda_{1}$ and $\tilde{\lambda}_{1}$ , respectively, be the maximum eigenvalues of matrices $Z^{\top}Z$ and $\tilde{Z}^{\top}\tilde{Z}$ . Then, the following equation holds:

[TABLE]

Proof.

We use the same notations as in Lemmas B1. Let $\{\lambda_{j}\}$ and $\{\bm{v}_{j}\}$ , respectively, be the sets of the eigenvalues and the corresponding normalized eigenvectors (i.e., $\|\bm{v}_{j}\|=1$ for all $j$ ) of matrix $Z^{\top}Z$ , where $j=1,\dots,p$ and $\lambda_{1}\geq\lambda_{2}\geq\dots\geq\lambda_{p}$ . We also define that $\tau_{kh}\equiv\frac{S_{kh}}{\tilde{S}_{kh}}$ . Note that $|\tau_{kh}-1|=O_{p}\left(\frac{1}{m}\right)$ from (77). Let $\tilde{\bm{v}}^{(h)}\in\mathbb{R}^{p_{h}}$ be a subvector of $\tilde{\bm{v}}$ in the $h$ th column cluster.

Since $Z^{\top}Z$ is a symmetric matrix, its eigenvectors $\{\bm{v}_{j}\}$ form an orthonormal system, and thus there exists a unique set of coefficients $\{c_{j}\}$ that satisfies

[TABLE]

where

[TABLE]

Therefore, the following equation holds:

[TABLE]

As for the last term in (B), the following equation holds:

[TABLE]

where $\bm{u}^{(h)}\in\mathbb{R}^{p}$ is a vector whose elements in the $h$ th column cluster is $\frac{1}{\sqrt{p_{h}}}$ and whose all the other elements are zero. In the last equation in (B), we used the delocalization property of $\{\bm{v}_{j}\}$ , which are eigenvectors of matrix $Z^{\top}Z$ . By substituting (84) into (B) and using the fact that $\tau_{kh}=1+O_{p}\left(\frac{1}{m}\right)$ and the assumption that $K$ and $H$ are fixed constants, we have

[TABLE]

Here, by definition in (B), the following equation holds:

[TABLE]

The third term in (B) can be upper bounded as follows:

[TABLE]

where $\bm{\tilde{w}}^{(k)}\in\mathbb{R}^{n}$ is a vector whose elements in the $k$ th row cluster is $\frac{1}{\sqrt{n_{k}}}$ and whose all the other elements are zero. Here we used the delocalization property of $\{\bm{v}_{j}\}$ , which are eigenvectors of matrix $Z^{\top}Z$ .

The fourth term in (B) can also be upper bounded as follows:

[TABLE]

By substituting (B) and (B) into (B), we have

[TABLE]

In the last equation of (B), we used the assumption that $K$ and $H$ are fixed constants.

Let $\nu_{j}\equiv\frac{1}{n}\lambda_{j}$ be a normalized eigenvalue of matrix $Z^{\top}Z$ . Note that $t$ in (B) is the number of normalized eigenvalues $\{\nu_{j}\}$ that satisfy $\nu_{j}\geq\nu_{1}-n^{d-1}$ . We also define the following variables:

[TABLE]

From (4.1) of [39], $|\epsilon_{1}|=O_{p}\left(\phi^{C}m^{-\frac{2}{3}}\right)$ holds for some constant $C>0$ , where $\phi\equiv(\log p)^{\log\log p}$ . Since $\phi=o(m^{\tilde{\epsilon}_{0}})$ holds for any $\tilde{\epsilon}_{0}>0$ , we have $|\epsilon_{1}|=O_{p}\left(m^{-\frac{2}{3}+\epsilon_{0}}\right)$ for any $\epsilon_{0}>0$ .

Since $\nu_{j}$ follows the Marcenko–Pastur distribution, whose probability density function is given by

[TABLE]

by setting $\epsilon_{0}<d-\frac{1}{3}$ , we have

[TABLE]

From (B) and the fact that $|\epsilon_{1}|=O_{p}\left(m^{-\frac{2}{3}+\epsilon_{0}}\right)$ for any $\epsilon_{0}>0$ , by setting $\epsilon_{0}<d-\frac{1}{3}$ , the following equation holds:

[TABLE]

From (3.7) of [39], the difference between $\bar{n}$ and $\frac{t}{p}$ is given by

[TABLE]

Therefore, by setting $\epsilon_{2}<\frac{3}{2}d-\frac{1}{2}$ , we have

[TABLE]

By assumption in (B) that $d=\frac{5}{7}$ , the following equation holds for all $\epsilon>0$ :

[TABLE]

Here, we consider the following two patterns: (a) If $n^{\frac{1}{2}}\varpi-n^{d}\|\tilde{\bm{v}}_{2}\|\leq 0$ , from (B), we have

[TABLE]

(b) If $n^{\frac{1}{2}}\varpi-n^{d}\|\tilde{\bm{v}}_{2}\|>0$ , we have $\|\tilde{\bm{v}}_{2}\|<n^{\frac{1}{2}-d}\varpi$ and thus

[TABLE]

By assumption in (B) that $d=\frac{5}{7}$ , we have $n^{1-d}\varpi^{2}=O_{p}\left(m^{\frac{2}{7}}\right)$ and thus (97) holds.

In summary, (97) always holds. By combining this fact and (85), we have

[TABLE]

By setting $\epsilon<\frac{2}{7}$ , we finally obtain

[TABLE]

which concludes the proof. ∎

Lemma B3.

Let $\lambda_{1}$ and $\hat{\lambda}_{1}$ , respectively, be the maximum eigenvalues of matrices $Z^{\top}Z$ and $\hat{Z}^{\top}\hat{Z}$ (i.e., $\|Z\|_{\mathrm{op}}^{2}$ and $\|\hat{Z}\|_{\mathrm{op}}^{2}$ , respectively). Then, for all $\epsilon\in\left(0,\frac{2}{7}\right)$ ,

[TABLE]

where $b$ is defined as in (6).

Proof.

From Lemma B1 and B2, we have already shown that the following equation holds for all $\epsilon\in\left(0,\frac{2}{7}\right)$ :

[TABLE]

We consider the joint probability of the event $\mathcal{F}_{m}$ that the clustering algorithm outputs the correct block structure (i.e., $\tilde{Z}=\hat{Z}$ ) and the event $\mathcal{G}_{m,C}$ that $\frac{|\lambda_{1}-\tilde{\lambda}_{1}|}{b}\leq Cm^{-\frac{1}{21}+\epsilon}$ holds. Such a joint probability satisfies the following inequality:

[TABLE]

where $\mathcal{A}^{\mathrm{C}}$ is the complement of event $\mathcal{A}$ . The consistency assumption (v) guarantees that if $(K_{0},H_{0})=(K,H)$ , $\mathrm{Pr}\left(\mathcal{F}^{\mathrm{C}}_{m}\right)$ converges to [math] in the limit of $m\to\infty$ . By combining this fact with (102), we obtain

[TABLE]

which results in (101). ∎

Appendix C Proof of $\hat{\sigma}^{*}=O_{p}(KH)$ in unrealizable case

Proof.

Throughout the proof, we use the following notations:

•

$A^{(k,h)}$ , $P^{(k,h)}$ , and $Z^{(k,h)}$ , respectively, are the $(k,h)$ th null blocks of matrices $A$ , $P$ , and $Z$ .

•

$\underline{A}^{(k,h)}$ , $\underline{P}^{(k,h)}$ , and $\underline{\hat{P}}^{(k,h)}$ , respectively, are the $(k,h)$ th estimated blocks of matrices $A$ , $P$ , and $\hat{P}$ .

•

We denote the row and column sizes of the $(k,h)$ th estimated block as $\underline{n}_{k}$ and $\underline{p}_{h}$ , respectively.

•

$(k_{1},h_{1})$ is the set of row and column cluster indices of submatrix $\bar{X}$ in the estimated block structure.

As for the order of the estimated standard deviation $\hat{\sigma}^{*}$ , we have $\hat{\sigma}^{*}=\hat{S}_{k_{1}h_{1}}$ . Note that the block size $(\bar{n}_{1},\bar{p}_{1})$ of submatrix $\bar{X}$ is at least $(n_{\mathrm{min}}/K_{0})\times(p_{\mathrm{min}}/H_{0})$ . Therefore, we have

[TABLE]

Here, for all $(i,j)$ , $\left(Z^{(k,h)}_{ij}\right)^{2}$ independently follows the same distribution, and $\mathbb{E}\left[\left(Z^{(k,h)}_{ij}\right)^{2}\right]=1$ . We also have $\mathbb{V}\left[\left(Z^{(k,h)}_{ij}\right)^{2}\right]=\mathbb{E}\left[\left(Z^{(k,h)}_{ij}\right)^{4}\right]-1<\infty$ , since we have assumed that $\mathbb{E}\left[\left(Z^{(k,h)}_{ij}\right)^{4}\right]<\infty$ from the sub-exponential assumption. Therefore, from the central limit theorem and Prokhorov’s theorem [48], we have $\frac{1}{\sqrt{n_{k}p_{h}}}\sum_{i=1}^{n_{k}}\sum_{j=1}^{p_{k}}\left[\left(Z^{(k,h)}_{ij}\right)^{2}-1\right]=O_{p}(1)$ . In other words, the following equation holds: $\sum_{i=1}^{n_{k}}\sum_{j=1}^{p_{k}}\left(Z^{(k,h)}_{ij}\right)^{2}=n_{k}p_{h}+O_{p}(m)=O_{p}(m^{2})$ . Based on this result, we obtain

[TABLE]

Furthermore, we have

[TABLE]

Here, to derive the last inequality in (C), we used the assumption that (4) holds for the block with the maximum difference between $\bar{P}$ and $\hat{P}$ . In the final equation, we used the fact that $\max_{\begin{subarray}{c}k=1,\dots,K,h=1,\dots,H,\\ k^{\prime}=1,\dots,K,h^{\prime}=1,\dots,H\end{subarray}}|B_{kh}-B_{k^{\prime}h^{\prime}}|$ is bounded by a finite constant.

By combining (C), (C), and (C), we obtain $\hat{\sigma}^{*}=O_{p}(KH)$ . ∎

Appendix D Proof of the asymptotic ICL in the Bernoulli case

Proof.

From Lemma 4.2 in [26], the resulting asymptotic ICL is given by

[TABLE]

In regard to the first term in (D), we consider the following optimization problem:

[TABLE]

The above problem is solved with the Lagrangian undetermined multiplier method, which employs

[TABLE]

By substituting (D) into (111), we have

[TABLE]

for all $(k,h)$ . In regard to $\{\pi_{k}\}$ and $\{\rho_{h}\}$ , from the conditions in (D), $\sum_{k}|I_{k}|=\xi_{1}$ and $\sum_{h}|J_{h}|=\xi_{2}$ hold and thus we finally have

[TABLE]

We can easily check that the solutions of (D) and (113) satisfy all the conditions in (D).

Finally, by substituting the above results into (D), we have

[TABLE]

Note that we have defined $\hat{B}_{kh}$ as in (4). ∎

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. P. W. Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming , 147(1):429–465, 2014.
2[2] P. Arabie, S. A. Boorman, and P. R. Levitt. Constructing blockmodels: How and why. Journal of Mathematical Psychology , 17(1):21–63, 1978.
3[3] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability , 21(3):1275–1294, 1993.
4[4] Z. Bao, G. Pan, and W. Zhou. Universality for the largest eigenvalue of sample covariance matrices with general population. The Annals of Statistics , 43(1):382–421, 2015.
5[5] P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(1):253–273, 2016.
6[6] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of sample covariance matrices. Probability Theory and Related Fields , 164:459–552, 2016.
7[7] V. Brault and A. Channarond. Fast and consistent algorithm for the latent block model. ar Xiv:1610.09005, 2016.
8[8] K. Chen and J. Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association , 113(521):241–251, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Goodness-of-fit Test for Latent Block Models

Abstract

1 Introduction

2 Problem setting and statistical model for goodness-of-fit test for latent block models

Assumptions.

3 Relation to existing works

3.1 Model selection for block models

Statistical-test-based methods (for SBM)

Cross-validation-based methods

Information-criterion-based methods

Other model selection methods

4 Test statistic for determining the set of cluster numbers

Theorem 4.1** (Realizable case).**

Proof.

Theorem 4.2** (Unrealizable case, lower bound).**

Proof.

Theorem 4.3** (Unrealizable case, upper bound).**

Proof.

5 Experiments

5.1 Realizable case: convergence of test statistic TTT in law to Tracy-Widom distribution

5.2 Unrealizable case: asymptotic behavior of test statistic TTT

Theorem 5.1** (Unrealizable case, two-sided bound).**

5.3 Accuracy of the proposed goodness-of-fit test

Comparison to the integrated completed likelihood (ICL)

5.4 Real data analysis: Congressional Voting Records Data Set

6 Discussion

7 Conclusion

Acknowledgments

Appendix A Proof of ∣S~kh−Skh∣=Op(1m)\left|\tilde{S}_{kh}-S_{kh}\right|=O_{p}\left(\frac{1}{m}\right)​S~kh​−Skh​​=Op​(m1​).

Lemma A1**.**

Proof.

Appendix B Proof of ∥Z^∥op2=∥Z∥op2+Op(m27)\|\hat{Z}\|_{\mathrm{op}}^{2}=\|Z\|_{\mathrm{op}}^{2}+O_{p}\left(m^{\frac{2}{7}}\right)∥Z^∥op2​=∥Z∥op2​+Op​(m72​) in realizable case

Lemma B1**.**

Proof.

Lemma B2**.**

Proof.

Lemma B3**.**

Proof.

Appendix C Proof of σ^∗=Op(KH)\hat{\sigma}^{*}=O_{p}(KH)σ^∗=Op​(KH) in unrealizable case

Proof.

Appendix D Proof of the asymptotic ICL in the Bernoulli case

Proof.

Theorem 4.1 (Realizable case).

Theorem 4.2 (Unrealizable case, lower bound).

Theorem 4.3 (Unrealizable case, upper bound).

5.1 Realizable case: convergence of test statistic $T$ in law to Tracy-Widom distribution

5.2 Unrealizable case: asymptotic behavior of test statistic $T$

Theorem 5.1 (Unrealizable case, two-sided bound).

Appendix A Proof of $\left|\tilde{S}_{kh}-S_{kh}\right|=O_{p}\left(\frac{1}{m}\right)$ .

Lemma A1.

Appendix B Proof of $\|\hat{Z}\|_{\mathrm{op}}^{2}=\|Z\|_{\mathrm{op}}^{2}+O_{p}\left(m^{\frac{2}{7}}\right)$ in realizable case

Lemma B1.

Lemma B2.

Lemma B3.

Appendix C Proof of $\hat{\sigma}^{*}=O_{p}(KH)$ in unrealizable case