A Note on Exploratory Item Factor Analysis by Singular Value   Decomposition

Haoran Zhang; Yunxiao Chen; and Xiaoou Li

arXiv:1907.08713·stat.ME·January 8, 2025

A Note on Exploratory Item Factor Analysis by Singular Value Decomposition

Haoran Zhang, Yunxiao Chen, and Xiaoou Li

PDF

TL;DR

This paper revisits a singular value decomposition algorithm for exploratory item factor analysis, establishing its statistical consistency, computational advantages, and practical utility in determining the number of factors.

Contribution

It provides the statistical foundation and asymptotic theory for an SVD-based exploratory IFA algorithm, highlighting its advantages over existing methods.

Findings

01

Algorithm guarantees a unique solution.

02

Demonstrates statistical consistency under double asymptotics.

03

Shows good finite sample performance in simulations.

Abstract

We revisit a singular value decomposition (SVD) algorithm given in Chen et al. (2019b) for exploratory Item Factor Analysis (IFA). This algorithm estimates a multidimensional IFA model by SVD and was used to obtain a starting point for joint maximum likelihood estimation in Chen et al. (2019b). Thanks to the analytic and computational properties of SVD, this algorithm guarantees a unique solution and has computational advantage over other exploratory IFA methods. Its computational advantage becomes significant when the numbers of respondents, items, and factors are all large. This algorithm can be viewed as a generalization of principal component analysis (PCA) to binary data. In this note, we provide the statistical underpinning of the algorithm. In particular, we show its statistical consistency under the same double asymptotic setting as in Chen et al. (2019b). We also demonstrate…

Figures9

Click any figure to enlarge with its caption.

Equations259

Pr (Y_{ij} = 1∣ θ_{i}) = f (d_{j} + a_{j}^{⊤} θ_{i}),

Pr (Y_{ij} = 1∣ θ_{i}) = f (d_{j} + a_{j}^{⊤} θ_{i}),

\overset{x}{^}_{ij} = ⎩ ⎨ ⎧ ϵ_{N, J}, if x_{ij} < ϵ_{N, J}, x_{ij}, if ϵ_{N, J} \leq x_{ij} \leq 1 - ϵ_{N, J}, 1 - ϵ_{N, J}, if x_{ij} > 1 - ϵ_{N, J} .

\overset{x}{^}_{ij} = ⎩ ⎨ ⎧ ϵ_{N, J}, if x_{ij} < ϵ_{N, J}, x_{ij}, if ϵ_{N, J} \leq x_{ij} \leq 1 - ϵ_{N, J}, 1 - ϵ_{N, J}, if x_{ij} > 1 - ϵ_{N, J} .

L_{N, J} (A^{*}, \hat{A}) = O \in R^{K \times K} min {\frac{∥ A ^{*} - A ^ O ∥ _{F}^{2}}{J K}},

L_{N, J} (A^{*}, \hat{A}) = O \in R^{K \times K} min {\frac{∥ A ^{*} - A ^ O ∥ _{F}^{2}}{J K}},

x \to - \infty lim f (x) = 0, \mbox an d x \to \infty lim f (x) = 1.

x \to - \infty lim f (x) = 0, \mbox an d x \to \infty lim f (x) = 1.

Pr (∥ θ_{1}^{*} ∥ \geq h (2 ϵ_{N, J}) / C) = o (N^{- 1}),

Pr (∥ θ_{1}^{*} ∥ \geq h (2 ϵ_{N, J}) / C) = o (N^{- 1}),

\frac{( h ( 2 ϵ _{N, J} ) ) ^{\frac{K + 1}{K + 3}}}{( ϵ _{N, J} g ( ϵ _{N, J} ) ) ^{2}} = o (J^{\frac{1}{K + 3}}),

h (y)

h (y)

g (y)

Pr (∥ θ_{1}^{*} ∥ \geq C_{0}) = 0,

Pr (∥ θ_{1}^{*} ∥ \geq C_{0}) = 0,

0 < ϵ \leq \frac{1}{2} min {1 - f (C C_{0}^{2} + 1), f (- C C_{0}^{2} + 1), \frac{2}{5}},

0 < ϵ \leq \frac{1}{2} min {1 - f (C C_{0}^{2} + 1), f (- C C_{0}^{2} + 1), \frac{2}{5}},

J \leq N \leq J^{β} .

J \leq N \leq J^{β} .

ϵ_{N, J} = γ_{0} J^{- γ_{1}},

ϵ_{N, J} = γ_{0} J^{- γ_{1}},

N, J \to \infty lim Pr (\frac{σ ^ _{K}}{N J} > δ) = 1, \mbox an d \frac{σ ^ _{K + 1}}{N J} \to p r 0,

N, J \to \infty lim Pr (\frac{σ ^ _{K}}{N J} > δ) = 1, \mbox an d \frac{σ ^ _{K + 1}}{N J} \to p r 0,

\frac{1}{N J} ∥ X^{*} - \hat{X} ∥_{F}^{2} = O_{p} (J^{- \frac{1}{K + 2}}) .

\frac{1}{N J} ∥ X^{*} - \hat{X} ∥_{F}^{2} = O_{p} (J^{- \frac{1}{K + 2}}) .

X = (x_{ij})_{N \times J} = \frac{1}{p ^} k = 1 \sum \tilde{K} σ_{k} u_{k} v_{k}^{⊤},

X = (x_{ij})_{N \times J} = \frac{1}{p ^} k = 1 \sum \tilde{K} σ_{k} u_{k} v_{k}^{⊤},

\overset{x}{^}_{ij} = ⎩ ⎨ ⎧ ϵ_{N, J}, if x_{ij} < ϵ_{N, J}, x_{ij}, if ϵ_{N, J} \leq x_{ij} \leq 1 - ϵ_{N, J}, 1 - ϵ_{N, J}, if x_{ij} > 1 - ϵ_{N, J} .

\overset{x}{^}_{ij} = ⎩ ⎨ ⎧ ϵ_{N, J}, if x_{ij} < ϵ_{N, J}, x_{ij}, if ϵ_{N, J} \leq x_{ij} \leq 1 - ϵ_{N, J}, 1 - ϵ_{N, J}, if x_{ij} > 1 - ϵ_{N, J} .

Pr (Y_{ij} \geq t ∣ θ_{i}) = f (d_{j t} + a_{j}^{⊤} θ_{i}),

Pr (Y_{ij} \geq t ∣ θ_{i}) = f (d_{j t} + a_{j}^{⊤} θ_{i}),

Q_{4} = {(q_{1}, ..., q_{4})^{⊤} : q_{k} \in {0, 1}, k = 1 \sum 4 q_{k} \geq 1, \mbox an d k = 1 \sum 4 q_{k} \leq 3},

Q_{4} = {(q_{1}, ..., q_{4})^{⊤} : q_{k} \in {0, 1}, k = 1 \sum 4 q_{k} \geq 1, \mbox an d k = 1 \sum 4 q_{k} \leq 3},

Q_{8} = {(q_{1}, ..., q_{8})^{⊤} : q_{k} \in {0, 1}, k = 1 \sum 8 q_{k} \geq 1, \mbox an d k = 1 \sum 8 q_{k} \leq 3} .

Q_{8} = {(q_{1}, ..., q_{8})^{⊤} : q_{k} \in {0, 1}, k = 1 \sum 8 q_{k} \geq 1, \mbox an d k = 1 \sum 8 q_{k} \leq 3} .

X^{*}:=(x_{ij}^{*})_{N\times J}=f(\Theta^{*}(A^{*})^{\top}+\mathbf{1}_{N}(\mbox{$\mathbf{d}$}^{*})^{\top})

X^{*}:=(x_{ij}^{*})_{N\times J}=f(\Theta^{*}(A^{*})^{\top}+\mathbf{1}_{N}(\mbox{$\mathbf{d}$}^{*})^{\top})

\tilde{x}_{ij} = ⎩ ⎨ ⎧ 0, if x_{ij} < 0, x_{ij}, if 0 \leq x_{ij} \leq 1, 1, if x_{ij} > 1,

\tilde{x}_{ij} = ⎩ ⎨ ⎧ 0, if x_{ij} < 0, x_{ij}, if 0 \leq x_{ij} \leq 1, 1, if x_{ij} > 1,

∣ \overset{σ}{^}_{k} - σ_{k}^{*} ∣ \leq ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{2} \leq ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{F} .

∣ \overset{σ}{^}_{k} - σ_{k}^{*} ∣ \leq ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{2} \leq ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{F} .

\frac{1}{N J} ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{F} \to p r 0.

\frac{1}{N J} ∥ \hat{M} - Θ^{*} (A^{*})^{⊤} ∥_{F} \to p r 0.

\frac{∣ σ ^ _{k} - σ _{k}^{*} ∣}{N J} \to p r 0.

\frac{∣ σ ^ _{k} - σ _{k}^{*} ∣}{N J} \to p r 0.

\frac{σ ^ _{K + 1}}{N J} \to p r 0.

\frac{σ ^ _{K + 1}}{N J} \to p r 0.

Pr (\frac{∣ σ ^ _{K} - σ _{K}^{*} ∣}{N J} \leq \tilde{ϵ}) \to 1

Pr (\frac{∣ σ ^ _{K} - σ _{K}^{*} ∣}{N J} \leq \tilde{ϵ}) \to 1

Pr (\frac{σ ^ _{K}}{N J} \geq \frac{1}{N J} σ_{K}^{*} - \tilde{ϵ}) \to 1.

Pr (\frac{σ ^ _{K}}{N J} \geq \frac{1}{N J} σ_{K}^{*} - \tilde{ϵ}) \to 1.

\frac{1}{N J} σ_{K}^{*}

\frac{1}{N J} σ_{K}^{*}

\geq \frac{1}{N} σ_{K} (Θ^{*}) \frac{1}{J} σ_{K} (A^{*})

\geq C_{1} \frac{1}{N} σ_{K} (Θ^{*}) .

\frac{1}{N} σ_{K} (Θ^{*}) = λ_{K} (\hat{Σ}) .

\frac{1}{N} σ_{K} (Θ^{*}) = λ_{K} (\hat{Σ}) .

∥ \hat{Σ} - Σ^{*} ∥_{2} \to p r 0

∥ \hat{Σ} - Σ^{*} ∥_{2} \to p r 0

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Note on Exploratory Item Factor Analysis by Singular Value Decomposition

Haoran Zhang, Yunxiao Chen and Xiaoou Li

Abstract

We revisit a singular value decomposition (SVD) algorithm given in Chen et al. (2019b) for exploratory Item Factor Analysis (IFA). This algorithm estimates a multidimensional IFA model by SVD and was used to obtain a starting point for joint maximum likelihood estimation in Chen et al. (2019b). Thanks to the analytic and computational properties of SVD, this algorithm guarantees a unique solution and has computational advantage over other exploratory IFA methods. Its computational advantage becomes significant when the numbers of respondents, items, and factors are all large. This algorithm can be viewed as a generalization of principal component analysis (PCA) to binary data. In this note, we provide the statistical underpinning of the algorithm. In particular, we show its statistical consistency under the same double asymptotic setting as in Chen et al. (2019b). We also demonstrate how this algorithm provides a scree plot for investigating the number of factors and provide its asymptotic theory. Further extensions of the algorithm are discussed. Finally, simulation studies suggest that the algorithm has good finite sample performance.

KEY WORDS: Exploratory item factor analysis, IFA, singular value decomposition, double asymptotics, generalized PCA for binary data

1 Background

Exploratory IFA (Bock et al., 1988) has been widely used for analyzing item-level data in social and behavioral sciences (Bartholomew et al., 2008). We consider a standard exploratory IFA setting for binary item response data. Let $Y_{ij}\in\{0,1\}$ be a random variable, denoting individual $i$ ’s response to item $j$ , where $i=1,...,N$ , and $j=1,...,J$ . Moreover, IFA assumes that an individual $i$ ’s responses are driven by $K$ latent factors, denoted by $\bm{\theta}_{i}=(\theta_{i1},...,\theta_{iK})^{\top}$ . We consider a general family of multidimensional IFA models (Reckase, 2009), which assumes that

[TABLE]

where $\mathbf{a}_{j}=(a_{j1},...,a_{jK})^{\top}$ is typically known as the loading parameters, $d_{j}$ is an intercept parameter, and $f:\mathbb{R}\mapsto(0,1)$ is a pre-specified monotone increasing function which guarantees (1) to be a valid probability. Using the terminology from generalized linear models, $f$ is called the inverse link function. Note that (1) includes the widely used multidimensional two-parameter logistic (M2PL) model and multidimensional normal ogive model as special cases, for which $f(x)=\exp(x)/(1+\exp(x))$ and $f(x)=\int_{-\infty}^{x}\exp(-t^{2}/2)/(2\pi)dt$ , respectively. Moreover, we assume local independence; that is, $Y_{i1}$ , …, $Y_{iJ}$ are conditionally independent given $\bm{\theta}_{i}$ . Finally, $\bm{\theta}_{i}$ , $i=1,...,N$ , are independent and identically distributed, following an unknown distribution $F$ .

A major focus of exploratory IFA is to estimate the loading matrix $A=(a_{jk})_{J\times K}$ , which helps to understand the latent structure underlying the set of items. It is worth noting that the loading matrix can only be recovered up to an oblique rotation (Browne, 2001).111We only discuss oblique rotation here, as our general exploratory IFA model does not require the factors to be uncorrelated. If the factors are further required to be uncorrelated, then the loading matrix can be recovered up to an orthogonal rotation, for which the rotation matrix $O$ is an orthogonal matrix (e.g., Kaiser, 1958). That is, model (1) will remain unchanged, with a rotated loading vector $\tilde{\mathbf{a}}_{j}=O^{\top}\mathbf{a}_{j}$ and $\tilde{\bm{\theta}}_{i}=O^{-1}\bm{\theta}_{i}$ , where $O$ is an $K\times K$ invertible matrix that is also known as an oblique rotation. Recognizing the rotational indeterminacy issue, exploratory IFA typically proceeds in two steps. In the first step, an estimate $\hat{A}$ is obtained, using an arbitrary way to fix the rotation. Then in the second step, analytic rotational methods are applied to $\hat{A}$ to obtain a more sparse loading matrix for better interpretability. An analytic rotation finds a rotation matrix $O$ such that $\hat{A}O$ minimizes a certain “complexity function”, where a lower value of the complexity function indicates more sparsity in the loading matrix (see Browne, 2001, for a review of analytic rotations). It implicitly assumes that the true loading matrix has a sparse pattern; i.e., each item is only directly associated with a small number of factors.

In this note, we focus on the first step of exploratory IFA. In particular, we study an estimator given in Chen et al. (2019b) that is based on SVD. Comparing with other estimators, this estimator is computationally much faster and does not suffer from convergence issues. It was used to obtain a starting point for a constrained joint maximum likelihood estimator (CJMLE). Simulation studies showed that the convergence of CJMLE can be improved by using the SVD-based estimator as a starting point. Moreover, this SVD-based estimator itself is reasonably accurate when both $N$ and $J$ are large. Thus, it can be used not only as a starting point for the CJMLE, but also as a quick and high-quality solution to large-scale exploratory IFA problems. In what follows, we investigate the statistical properties of this estimator.

2 Main results

SVD-based estimator.

We restate this SVD-based algorithm below.222The original algorithm was described in the supplementary material of Chen et al. (2019b). The algorithm here is a slightly modified version. The major modification is in step 3 of the algorithm that requires at least $K+1$ singular values to be retained. This modification can improve the finite-sample performance of the algorithm; see Remark 4 for more discussions. The other modifications are mainly to simplify the exposition of the algorithm.

Algorithm 1 (SVD-based estimator for exploratory IFA).

Input response $Y=(y_{ij})_{N\times J}$ , the number of factors $K$ , inverse link function $f$ , and truncation parameter $\epsilon_{N,J}>0$ . 2. 2.

*Apply the singular value decomposition to $Y$ and obtain $Y=\sum_{j=1}^{J}\sigma_{j}\mathbf{u}_{j}\mathbf{v}_{j}^{\top}$ , where $\sigma_{1}\geq...\geq\sigma_{J}\geq 0$ are the singular values, and * $\mathbf{u}_{j}$ *s and * $\mathbf{v}_{j}$ s are left and right singular vectors, respectively. 3. 3.

Let $X=(x_{ij})_{N\times J}=\sum_{k=1}^{\tilde{K}}\sigma_{k}\mathbf{u}_{k}\mathbf{v}_{k}^{\top},$ where $\tilde{K}=\max\big{\{}K+1,\operatornamewithlimits{arg\,max}_{k}\{\sigma_{k}\geq 1.01\sqrt{N}\}\big{\}}$ . 4. 4.

Let $\hat{X}=(\hat{x}_{ij})_{N\times J}$ be defined as

[TABLE] 5. 5.

Let $\tilde{M}=(\tilde{m}_{ij})_{N\times J},$ where $\tilde{m}_{ij}=f^{-1}(\hat{x}_{ij}).$ 6. 6.

Let $\hat{}\mbox{$ \mathbf{d} $}=(\hat{d}_{1},...,\hat{d}_{J})$ , where $\hat{d}_{j}=(\sum_{i=1}^{N}\tilde{m}_{ij})/N$ . 7. 7.

*Apply singular value decomposition to $\hat{M}=(\tilde{m}_{ij}-\hat{d}_{j})_{N\times J}$ to have $\hat{M}=\sum_{j=1}^{J}\hat{\sigma}_{j}\hat{\mathbf{u}}_{j}\hat{\mathbf{v}}_{j}^{\top}$ , where $\hat{\sigma}_{1}\geq...\geq\hat{\sigma}_{J}\geq 0$ are the singular values, and * $\hat{\mathbf{u}}_{j}$ *s and * $\hat{\mathbf{v}}_{j}$ s are the left and right singular vectors, respectively. 8. 8.

Output $\hat{A}=\frac{1}{\sqrt{N}}(\hat{\sigma}_{1}\hat{\mathbf{v}}_{1},...,\hat{\sigma}_{K}\hat{\mathbf{v}}_{K}),\hat{\Theta}=\sqrt{N}(\hat{\mathbf{u}}_{1},...,\hat{\mathbf{u}}_{K}).$

Remark 1.

SVD is a powerful tool for the factorization of rectangular matrices that has been widely used in multivariate statistics for the dimension reduction of data (Wall et al., 2003). Thanks to the mathematical properties of SVD, the estimator given by Algorithm 1 is analytic that does not suffer from convergence issues. On the other hand, as the objective functions of the CJMLE and the marginal maximum likelihood estimator (MMLE; Bock and Aitkin, 1981) are nonconvex, there is no guarantee for finding their global optima. In addition, this SVD approach is also much faster than the other estimators, including the CJMLE and MMLE. In particular, the computation of the MMLE based on the vanilla expectation maximization algorithm is not affordable when the latent dimension $K$ is of a moderate size (e.g., $K\geq 5$ ). Even the stochastic algorithms for the MMLE (Cai, 2010a, b; Zhang et al., 2020) and the alternating minimization algorithm for the CJMLE (Chen et al., 2019b, c) are much slower than the SVD algorithm, as these algorithms typically need a large number of iterations to converge. A speed comparison is provided in the simulation study between the SVD method and the CJMLE.

Remark 2.

Algorithm 1 can be viewed as a generalization of PCA to binary data. PCA is an SVD-based algorithm (e.g., Chapter 14, Friedman et al., 2001) that is fast and commonly used for exploratory linear factor analysis. Unfortunately, PCA cannot be applied to exploratory IFA, due to the nonlinear link function in IFA models. Unlike PCA which applies SVD only once, Algorithm 1 applies SVD twice. The first application of SVD and the inverse transformation (steps 2-5) denoise and linearize the data. Then, the second application of SVD (steps 6-7) is essentially doing PCA to the linearized data.

Remark 3.

Similar as the CJMLE (Chen et al., 2019b, c), this SVD-based estimator does not require the latent distribution $F$ to be known or to take a parametric form as is required in the MMLE approach. Moreover, exploratory IFA based on tetrachoric/polychoric correlations (Muthén, 1984; Lee et al., 1990, 1992; Jöreskog, 1994) or composite-likelihood-based estimator (Katsikatsou et al., 2012) requires $F$ to be multivariate normal, with the former approach further requiring the inverse link $f$ to be probit. In this sense, the SVD-based estimator and the CJMLE require less model assumptions than the other estimators. As a price, their consistency requires stronger conditions, specifically, a double asymptotic regime where both $N$ and $J$ diverge.

Remark 4.

Steps 2-4 of the algorithm essentially follow the same procedure of Chatterjee (2015) for matrix estimation. We thus refer the readers to Chatterjee (2015) for the details. A small difference is that we require $\tilde{K}\geq K+1$ in step 3 of the algorithm. This modification does not affect the asymptotic behavior of the estimator. However, it can improve the finite-sample performance when $N$ and $J$ are not large enough. Intuitively, we need $\tilde{K}$ to be at least $K+1$ , in order to recover the matrix $(d_{j}+\mathbf{a}_{j}^{\top}\bm{\theta}_{i})_{N\times J}$ which is of rank $K+1$ . The constant 1.01 in step 3 of the algorithm follows Theorem 1.1 of Chatterjee (2015), which makes use of the fact that $Var(Y_{ij})\leq 1/4$ . This constant can be replaced by any fixed constant in the open interval $(1,1.5)$ , without affecting its consistency given in Theorem 1 below. We set it to be 1.01, because according to Theorem 1.1 of Chatterjee (2015) this constant should be chosen close to 1 for better accuracy.

Remark 5.

The truncation step (step 4) is necessary, as it guarantees the existence of a solution. This is because, even though $x_{ij}$ in step 3 is approximating the true probability $\Pr(Y_{ij}=1)$ , it is not guaranteed to be in the interval $(0,1)$ . As a consequence, $f^{-1}(x_{ij})$ may not be well-defined. The pre-specified truncation parameter $\epsilon_{N,J}>0$ determines the truncation level. As shown in the sequel, the choice of $\epsilon_{N,J}$ affects the statistical consistency of the proposed algorithm. Under certain circumstances, we will need the truncation parameter $\epsilon_{N,J}$ to decay to zero as $N$ and $J$ grow to infinity, which is why we attach subscripts $N$ and $J$ to the truncation parameter. In practice, the performance of the proposed method tends to be insensitive to the choice of $\epsilon_{N,J}$ when it is chosen sufficiently small, which is justified theoretically by Propositions 1 and 2 below, under two specific settings. In the numerical analysis of this paper, we use $\epsilon_{N,J}=10^{-4}$ as a default value.

Statistical consistency.

In what follows, we establish the theoretical consistency of this method. In particular, we show that this SVD-based algorithm is consistent under similar asymptotic setting and notion of consistency as in Chen et al. (2019b) and Chen et al. (2019c). The proofs of our theoretical results are given in the supplementary material. More precisely, we consider a loss function on the recovery of the true loading matrix $A^{*}=(a_{jk}^{*})_{J\times K}$ up to an oblique rotation

[TABLE]

where the subscripts $N$ and $J$ are used to emphasize that the loss function depends on the sample size $N$ and the number of items $J$ , and $\|X\|_{F}=\sqrt{\sum_{i}\sum_{j}x_{ij}^{2}}$ denotes the Frobenius norm of a matrix $X=(x_{ij})$ . Under mild technical conditions and a double asymptotic setting where both $N$ and $J$ grow to infinity, we show the loss function $L_{N,J}(A^{*},\hat{A})$ converges to zero in probability. The regularity conditions and the consistency result are formally described in Theorem 1, with two special cases discussed in the sequel. Similar double asymptotic settings have been considered in psychometric research, including the analyses of unidimensional IRT models (Haberman, 1977, 2004) and diagnostic classification models (Chiu et al., 2016). The following regularity conditions are needed for our main result in Theorem 1. As will be discussed in the sequel, these conditions are mild.

A1.

There exists a constant $C$ such that $\sqrt{(d_{j}^{*})^{2}+\|\mathbf{a}^{*}_{j}\|^{2}}\leq C$ , for $j=1,...,J$ , where $d_{j}^{*}$ and $\mathbf{a}_{j}^{*}$ are the true item parameters.

A2.

The true person parameters $\bm{\theta}_{1}^{*},...,\bm{\theta}_{N}^{*}$ are independent and identically distributed (i.i.d.) following a distribution $F$ which has mean $\mathbf{0}$ and positive definite covariance matrix $\Sigma.$

A3.

The inverse link function $f$ is strictly monotone increasing, continuously differentiable, and Lipschitz continuous with Lipschitz constant $L$ . We further assume that

[TABLE]

A4.

There exists a constant $C_{1},$ such that the $K$ th singular value of $A^{*}$ , denoted by $\sigma_{K}(A^{*})$ , satisfies $\sigma_{K}(A^{*})\geq C_{1}\sqrt{J}$ for all $J$ .

A5.

The sample size $N$ is no less than the number of items $J$ , i.e., $N\geq J$ .

Theorem 1.

Suppose that conditions A1-A5 are satisfied. Further suppose that $\epsilon_{N,J}\leq\frac{1}{5}$ and satisfies

[TABLE]

where

[TABLE]

Then the estimate $\hat{A}$ given by Algorithm 1 satisfies $L_{N,J}(A^{*},\hat{A})\overset{pr}{\to}0$ , as $N,J\rightarrow\infty.$

Remark 6.

We remark that the notion of consistency for the estimation of the loading matrix is weaker than that in the traditional sense, since the loss function (2) is an average of the entry-wise losses when $J$ grows. Let $\tilde{O}$ minimize the right hand side of (2) and let $\tilde{A}:=(\tilde{a}_{jk})_{J\times K}=\hat{A}\tilde{O}$ . Then (2) converges to 0 means that for any $\epsilon>0$ , $({\sum_{j=1}^{J}\sum_{k=1}^{K}1_{\{|a_{jk}^{*}-\tilde{a}_{jk}|>\epsilon\}}})/{JK}$ also converges to 0. That is, the proportion of inaccurately estimated loading parameters converges to zero in probability under the optimal rotation. Due to the double asymptotic setting, our theoretical result only suggests the sensible use of the SVD-based algorithm when the sample size $N$ and the number of items $J$ are both large.

Remark 7.

It has been well-understood that PCA can consistently estimate a linear factor model under a similar double asymptotic setting (Stock and Watson, 2002), which provides the theoretical justification for the use of PCA in exploratory linear factor analysis. Theorem 1 can be viewed as a similar result for exploratory item factor analysis.

Remark 8.

We provide some discussions on the regularity conditions required in Theorem 1. Assumption A1 requires that the parameters of each item, including the intercept and slope parameters, should not be too large. That is, the presence of an extreme item is likely to distort the analysis. Assumption A2 is a very standard assumption in exploratory IFA. It is more flexible than many exploratory IFA settings, as it does not require the distribution $F$ to be multivariate normal. Assumption A3 is satisfied by the logistic and probit link functions, two most commonly used link functions in exploratory IFA, but it excludes, for example, the multidimensional version of the three-parameter logistic model, as a special case. Assumption A4 requires that there is sufficient variability in the items. The same assumption is also required in Chen et al. (2019b) and Chen et al. (2019c). In fact, this assumption is satisfied with probability tending to one, when the true loadings $\mathbf{a}_{j}^{*}$ are i.i.d. samples from a $K$ -variate distribution whose covariance matrix is non-degenerate. Finally, assumption A5 is practically reasonable, as in large-scale measurement, the sample size is usually larger than the number of items. Since people and items are almost mathematically symmetric in the IFA model, similar asymptotic results can be derived when $J\geq N$ .

Remark 9.

We further provide some intuitions on the reason why the algorithm works. Steps 2-4 essentially follow the same procedure of Chatterjee (2015) for matrix estimation. The procedure guarantees the loss ${\sum_{i,j}(f(d_{j}^{*}+(\mathbf{a}_{j}^{*})^{\top}\bm{\theta}_{i}^{*})-\hat{x}_{ij})^{2}}/{(NJ)}$ to be small with high probability, where $d_{j}^{*}$ and $\mathbf{a}_{j}^{*}$ denote the true item specific parameters and $\bm{\theta}_{i}^{*}$ denotes the true person parameters sampled from distribution $F$ . Further with conditions A1 and A3, steps 5 and 6 guarantee the average loss $\sum_{i=1}^{N}\sum_{j=1}^{J}((\mathbf{a}_{j}^{*})^{\top}\bm{\theta}_{i}^{*}-\hat{\mathbf{a}}_{j}^{\top}\hat{\bm{\theta}}_{i})^{2}/(NJ)$ to be small with high probability. Finally, under conditions A2 and A4, the famous Davis-Kahan-Wedin theorem from matrix perturbation theory (see e.g., Stewart and Sun, 1990; O’Rourke et al., 2018) guarantees that $L_{N,J}(A^{*},\hat{A})$ is small with high probability.

Remark 10.

Equations (3) and (4) are requirements on the truncation parameter $\epsilon_{N,J}$ , which depends on both the tail of distribution $F$ and the properties of the inverse link function. Roughly speaking, Equation (3) is saying that $\epsilon_{N,J}$ cannot be too large. This is because, given $F$ and $f$ , the probability in (3) is increasing in $\epsilon_{N,J}$ . Requiring the probability being $o(N^{-1})$ implies that $\epsilon_{N,J}$ cannot be large. This requirement is intuitive, because $\tilde{M}$ can be a poor approximation to $M^{*}=(m_{ij}^{*})_{N\times J}:=(d_{j}^{*}+(\mathbf{a}_{j}^{*})^{\top}\bm{\theta}_{i}^{*})_{N\times J}$ , when many entries of $M^{*}$ are larger than $h(\epsilon_{N,J})$ . The function $h(\cdot)$ transforms the truncation on $x_{ij}$ to a truncation on $\tilde{m}_{ij}$ . Using $h(2\epsilon_{N,J})$ instead of $h(\epsilon_{N,J})$ is for technical reasons.

Equation (4) requires that $\epsilon_{N,J}$ cannot be too small, as the left hand side of (4) is decreasing in $\epsilon_{N,J}$ . This requirement is also intuitive. Note that $|\tilde{m}_{ij}|\leq h(\epsilon_{N,J})$ , where $h(\epsilon_{N,J})$ is decreasing in $\epsilon_{N,J}$ . Therefore, a sufficiently large choice of $\epsilon_{N,J}$ avoids the approximation error $\|\tilde{M}-M^{*}\|_{F}$ being too large when there exist some extreme estimates $\tilde{m}_{ij}$ . Function $g(\cdot)$ measures the local flatness of the inverse link $f$ . The true matrix $M^{*}$ is more difficult to estimate when $g(\epsilon_{N,J})$ is smaller. This is because $|\tilde{m}_{ij}-m_{ij}^{*}|$ can be large, even when $|\hat{x}_{ij}-f(m_{ij}^{*})|$ is small, due to the local flatness of the inverse link function.

Remark 11.

We take a stochastic design for the true person parameters and a fixed design for the true item parameters, following the convention of item factor analysis (e.g., Bartholomew et al., 2008). It is worth pointing out that whether taking a stochastic or fixed design is not essential under our double asymptotic regime. For example, the consistent result of Theorem 1 still holds, if we can replace condition A2 by a corresponding fixed design as in Chen et al. (2019b).

Following the discussion on $\epsilon_{N,J}$ in Remark 10, we consider two concrete settings under which the requirement on $\epsilon_{N,J}$ becomes more specific. These results are given in Propositions 1 and 2.

Proposition 1.

Suppose that $F$ has a compact support. More precisely, there exists a constant $C_{0}$ , satisfying

[TABLE]

under the law of $F$ . If we fix $\epsilon_{N,J}$ to be a constant $\epsilon$ independent of $N$ and $J$ , satisfying

[TABLE]

then (3) and (4) are satisfied. This choice of $\epsilon_{N,J}$ , together with the regularity conditions in Theorem 1, guarantees $L_{N,J}(A^{*},\hat{A})$ to converge to zero in probability.

Proposition 2.

Consider exploratory IFA based on the M2PL model, where $F$ is a multivariate sub-Gaussian distribution333We say the distribution of a K-variate random vector $\bm{\theta}$ is sub-Gaussian, if there exist constant $b_{1},b_{2}>0$ such that for any $u\in\mathbb{R}^{K},\|\mathbf{u}\|=1$ and $t>0$ , $\Pr(|\mathbf{u}^{\top}\bm{\theta}|>t)\leq b_{1}e^{-b_{2}t^{2}}$ . In particular, the multivariate normal distribution is sub-Gaussian. and $f$ is the logistic link. Suppose that there exists a constant $\beta\geq 1$ such that

[TABLE]

Then (3) and (4) hold, for any $\epsilon_{N,J}$ taking the form

[TABLE]

where $\gamma_{0}$ and $\gamma_{1}$ are any constants satisfying $\gamma_{0}>0$ and $\gamma_{1}\in(0,(4(K+3))^{-1})$ . The choice of $\epsilon_{N,J}$ following (9), together with the regularity conditions in Theorem 1, guarantees $L_{N,J}(A^{*},\hat{A})$ to converge to zero in probability.

According to the result of Proposition 1, it suffices to choose $\epsilon_{N,J}$ as a sufficiently small positive constant, when $F$ has a bounded support. Under the setting of Proposition 2, to ensure consistency, one has to let $\epsilon_{N,J}$ decay to zero at an appropriate rate. Note that even in the second setting where the support of $F$ is unbounded, $\epsilon_{N,J}$ is almost like a constant, as it decays to zero very slowly when $J$ grows. These results suggest that we may choose $\epsilon_{N,J}$ to be a sufficiently small constant in practice.

On the choice of $K$ .

In the previous discussion, the number of factors $K$ is assumed to be known. In practice, however, this information is often unknown and an important task in exploratory IFA is to determine the number of factors based on data. When conducting exploratory linear factor analysis, one typically gains the first idea by examining the scree plot from principal component analysis. Thanks to the connection between Algorithm 1 and PCA as discussed in Remark 2, a similar scree plot is available from the current method.

The scree plot is produced as follows. We first run Algorithm 1, but replace the unknown $K$ in step 1 of the algorithm by a reasonably large number $K^{\dagger}$ . Then, a scree plot can be obtained by plotting $\hat{\sigma}_{k}$ in a descending order, for $\hat{\sigma}_{k}$ s produced by step 7 of Algorithm 1. Figure 1 shows such a scree plot, for which the data are generated from a five-factor model ( $K=5$ ) with $J=200$ and $N=4000$ , and the input number of factors is set to be $K^{\dagger}=10$ in step 1 of the algorithm. Unsurprisingly, an obvious gap is observed between $\hat{\sigma}_{5}$ and $\hat{\sigma}_{6}$ . In fact, when data follow an IFA model, such a gap in the singular values is guaranteed to exist asymptotically, no matter what the input dimension is. In practice, the latent dimension $K$ can be chosen by identifying the singular value gap from the scree plot.

Theorem 2.

Under the same conditions as Theorem 1 and when the input dimension $K^{\dagger}$ in Algorithm 1 is set fixed (i.e., independent of $N$ and $J$ ) but not necessarily equal to the true number of factors, there exists a constant $\delta>0$ such that for the true number of factors $K$ ,

[TABLE]

as $N$ and $J$ grow to infinity simultaneously.

Remark 12.

As shown in the proof, the input dimension $K^{\dagger}$ does not affect the asymptotics, as long as it does not grow with $N$ and $J$ . However, for relatively small $N$ and $J$ , $X$ obtained in step 3 of the algorithm may not reserve enough information when the input dimension is smaller than $K+1$ , which may lead to an underestimation of the number of factors. Thus, in practical applications, we recommend to choose the input dimension to be slightly larger than the maximum number of factors one suspects to exist in the data.

Statistical efficiency.

We further point out that a price is paid for the computational advantage of the SVD-based estimator. To elaborate on this point, we compare it with the CJMLE (Chen et al., 2019b, c). The CJMLE treats both item parameters and latent factors as fixed parameters and maximizes a joint likelihood function with respect to all the fixed parameters. The SVD-based estimator is statistically less efficient than the CJMLE, in the sense that the SVD-based estimator converges to the true parameters in a much slower rate. To make this comparison, we consider the same setting as in Proposition 1. The following proposition establishes the convergence rate for $\|X^{*}-\hat{X}\|_{F}^{2}/NJ,$ which determines the convergence of $\hat{A}$ . Here, $X^{*}=(f(d_{j}^{*}+\mathbf{a}_{j}^{*}(\bm{\theta}_{i}^{*})^{\top}))_{N\times J}$ is the true item response probability matrix.

Proposition 3.

Suppose that the same assumptions as in Proposition 1 hold and choose $\epsilon_{N,J}$ as in Proposition 1. Then we have

[TABLE]

On the other hand, as shown in Chen et al. (2019c), the CJMLE achieves the optimal rate (in minimax sense) for estimating $X^{*}$ , that is, $\|X^{*}-\hat{X}_{JML}\|_{F}^{2}/(NJ)=O_{p}(J^{-1}),$ where $\hat{X}_{JML}$ denotes the CJMLE. This result suggests that the SVD-based estimator converges in a much slower rate than the CJMLE.

3 Extensions

Dealing with missing data.

With slight modification, Algorithm 1 can handle item response data with missing values. We use matrix $W=(w_{ij})_{N\times J}$ to indicate the data nonmissingness, where $w_{ij}=1$ indicates the response $Y_{ij}$ is not missing and $w_{ij}=0$ otherwise. The modified algorithm is described as follows.

Algorithm 2 (SVD-based estimator for exploratory IFA with missing data).

Input nonmissing indicator $W=(w_{ij})_{N\times J}$ , nonmissing responses $\{y_{ij}:w_{ij}=1,i=1,...,N,j=1,...,J\}$ , the number of factors $K$ , inverse link function $f$ , and truncation parameter $\epsilon_{N,J}>0$ . 2. 2.

Compute $\hat{p}=(\sum_{i=1}^{N}\sum_{j=1}^{J}w_{ij})/(NJ)$ as the proportion of observed responses. 3. 3.

For each $i$ and $j$ , let $z_{ij}=y_{ij}$ , if $w_{ij}=1$ , and $z_{ij}=0$ if $w_{ij}=0$ . 4. 4.

*Apply the singular value decomposition to $Z$ to obtain $Z=\sum_{j=1}^{J}\sigma_{j}\mathbf{u}_{j}\mathbf{v}_{j}^{\top}$ , where $\sigma_{1}\geq...\geq\sigma_{J}\geq 0$ are the singular values and * $\mathbf{u}_{j}$ *s and * $\mathbf{v}_{j}$ s are left and right singular vectors, respectively. 5. 5.

Let

[TABLE]

where $\tilde{K}=\max\big{\{}K+1,\operatornamewithlimits{arg\,max}_{k}\{\sigma_{k}\geq 1.01\sqrt{N(\hat{p}+3\hat{p}(1-\hat{p}))}\}\big{\}}$ . 6. 6.

Let $\hat{X}=(\hat{x}_{ij})_{N\times J}$ be defined as

[TABLE] 7. 7.

Let $\tilde{M}=(\tilde{m}_{ij})_{N\times J},$ where $\tilde{m}_{ij}=f^{-1}(\hat{x}_{ij}).$ 8. 8.

Let $\hat{}\mbox{$ \mathbf{d} $}=(\hat{d}_{1},...,\hat{d}_{J})$ , where $\hat{d}_{j}=(\sum_{i=1}^{N}\tilde{m}_{ij})/N$ . 9. 9.

*Apply singular value decomposition to $\hat{M}=(\tilde{m}_{ij}-\hat{d}_{j})_{N\times J}$ to have $\hat{M}=\sum_{j=1}^{J}\hat{\sigma}_{j}\hat{\mathbf{u}}_{j}\hat{\mathbf{v}}_{j}^{\top}$ , where $\hat{\sigma}_{1}\geq...\geq\hat{\sigma}_{J}\geq 0$ are the singular values and * $\hat{\mathbf{u}}_{j}$ s and $\hat{\mathbf{v}}_{j}$ are the left and right singular vectors, respectively. 10. 10.

Output $\hat{A}=\frac{1}{\sqrt{N}}(\hat{\sigma}_{1}\hat{\mathbf{v}}_{1},...,\hat{\sigma}_{K}\hat{\mathbf{v}}_{K}),\hat{\Theta}=\sqrt{N}(\hat{\mathbf{u}}_{1},...,\hat{\mathbf{u}}_{K}).$

Remark 13.

It is easy to see that $\hat{p}=1$ when there is no missing data. In that case, Algorithm 2 becomes exactly the same as Algorithm 1. Steps 2-5 essentially follow the same procedure of Chatterjee (2015) for matrix completion and the rest of the steps are the same as those in Algorithm 1. Specifically, missing data are first imputed by zero in step 3 of the algorithm. The bias brought by the simple imputation procedure is corrected in Step 5, by multiplying the factor $1/\hat{p}$ . Similar to Algorithm 1, the choice of $\tilde{K}$ in step 5 is determined by the procedure of Chatterjee (2015) with a small modification which guarantees $\tilde{K}\geq K+1$ .

In fact, when the entries of the item response matrix are missing completely at random, using a similar proof, one can show that $\hat{A}$ given by Algorithm 2 is still consistent, under some mild condition on the missing data mechanism and the same conditions as in Theorem 1. Specifically, the following condition is needed, in addition to conditions A1-A5.

A6.

The $w_{ij}$ s are independent and identically distributed from a Bernoulli distribution with $\Pr(w_{ij}=1)=p,$ where $0<p\leq 1$ is a constant which does not depend on $N$ and $J$ .

Under conditions A1-A6, the following proposition holds that guarantees the consistency of the proposed SVD estimator.

Proposition 4.

Under the same conditions as Theorem 1 plus condition A6, the estimate $\hat{A}$ given by Algorithm 2 satisfies $L_{N,J}(A^{*},\hat{A})\overset{pr}{\to}0$ , as $N,J\rightarrow\infty.$

Dealing with ordinal data.

In exploratory IFA, ordinal data are also commonly encountered, due to the wide use of Likert-scale items. With slight modification, the SVD method can also be used to analyze ordinal data. This is achieved by applying Algorithm 1 to multiple dichotomized versions of data.

More precisely, consider data $Y=(Y_{ij})_{N\times J}$ , where $Y_{ij}\in\{0,1,...,T\}$ . We consider a general family of graded response type models,

[TABLE]

where $d_{jt}$ is an item- and category-specific intercept parameter, and the rest of the notations are the same as that of model (1). Note that the linear combination of the factors $\mathbf{a}_{j}^{\top}\bm{\theta}_{i}$ does not depend on the response category and appears in all the submodels $\Pr(Y_{ij}\geq t|\bm{\theta}_{i})$ for $t=1,...,T$ . When $f(x)=\exp(x)/(1+\exp(x))$ takes the logistic form, model (11) becomes the multidimensional graded response model (Muraki and Carlson, 1995).

Model (11) is closely related to the general model (1) for binary data. In fact, if we dichotomize data at response category $t$ , i.e., $Y_{ij}^{(t)}=1_{\{Y_{ij}\geq t\}}$ , then binary data $Y_{ij}^{(t)}$ follows model (1) with the same loading parameters. Therefore, the loading matrix $A$ can be estimated by applying Algorithm 1 to dichotomized data $Y^{(t)}=(1_{\{y_{ij}\geq t\}})_{N\times J}$ , for some $t=1,...,T$ . The estimation accuracy may be further improved by aggregating the results from multiple dichotomized versions of data. This aggregation method is summarized by Algorithm 3 below.

Algorithm 3 (SVD-based estimator for exploratory IFA with ordinal data).

Input response $Y=(y_{ij})_{N\times J}$ , the number of categories $T$ , the number of factors $K$ , inverse link function $f$ , and truncation parameter $\epsilon_{N,J}>0$ . 2. 2.

For $t=1,...,T$ , apply Algorithm 1 to dichotomized data $Y^{(t)}=(1_{\{y_{ij}\geq t\}})_{N\times J}$ and obtain $\hat{M}^{(t)}$ from step 7 of Algorithm 1. 3. 3.

*Let $\hat{M}=(\sum_{t=1}^{T}\hat{M}^{(t)})/{T}$ . Apply singular value decomposition to $\hat{M}$ and obtain $\hat{M}=\sum_{j=1}^{J}\hat{\sigma}_{j}\hat{\mathbf{u}}_{j}\hat{\mathbf{v}}_{j}^{\top}$ , where $\hat{\sigma}_{1}\geq...\geq\hat{\sigma}_{J}\geq 0$ are the singular values and * $\hat{\mathbf{u}}_{j}$ s and $\hat{\mathbf{v}}_{j}$ are left and right singular vectors, respectively. 4. 4.

Output $\hat{A}=\frac{1}{\sqrt{N}}(\hat{\sigma}_{1}\hat{\mathbf{v}}_{1},...,\hat{\sigma}_{K}\hat{\mathbf{v}}_{K}),\hat{\Theta}=\sqrt{N}(\hat{\mathbf{u}}_{1},...,\hat{\mathbf{u}}_{K}).$

4 Simulation

Simulation setting.

We consider $K=4$ and $8$ , $J=200,400,600,800$ , $1000$ , and $1200$ , and $N=20J$ . For each combination of $N$ , $J$ , and $K$ , two different latent distributions $F$ are considered, one is a $K$ -variate standard normal distribution, and the other is a $K$ -variate normal distribution $N(\mathbf{0},(\sigma_{ij})_{K\times K})$ , where $\sigma_{ij}=1$ if $i=j$ and $\sigma_{ij}=0.3$ if $i\neq j$ . The inverse link $f$ is chosen to be logistic, i.e. $f(x)=\exp(x)/(1+\exp(x))$ . This leads to 24 different simulation settings, for all possible combinations of $N$ , $J$ , $K$ , and $F$ .

For each simulation setting, 100 independent replications are generated, with the item parameters keeping fixed across replications. When $J=200$ and given $K$ , the item parameters are generated as follows.

$d_{1}^{*}$ , …, $d_{200}^{*}$ are i.i.d. from a uniform distribution over interval $[-1,1]$ . 2. 2.

$\mathbf{a}_{1}^{*}$ , …, $\mathbf{a}_{200}^{*}$ are i.i.d., with $\mathbf{a}_{j}^{*}=(a_{j1}^{\dagger}q_{j1},...,a_{jK}^{\dagger}q_{jK})^{\top}$ . Here $a_{jk}^{\dagger}$ s are i.i.d. from a uniform distribution over interval $[1,2]$ , and $\mathbf{q}_{j}=(q_{j1},...,q_{jK})^{\top}$ are i.i.d. from a uniform distribution over $\mathcal{Q}_{K}$ . Specifically,

[TABLE]

and

[TABLE]

The $\mathbf{q}_{j}$ s lead to sparse loading vectors.

When $J>200$ , we set the item parameters by repeating multiple times the parameters under $J=200$ and the same $K$ . For example, when $J=400$ , we set parameters for items 1-200 and those for items 201-400 to be the same as the parameters generated under the setting $J=200$ .

Results.

Each simulated dataset is analyzed using the SVD-based estimator, with the truncation parameter $\epsilon_{N,J}$ set to be $10^{-4}$ . The performance of the SVD-based estimator is compared with that of the CJMLE.444The CJMLE is implemented using R package mirtjml (Zhang et al., 2018). All the computation is conducted on a single Intel®Gold 6130 core.

The loss for the SVD-based estimator decreases when $N$ and $J$ simultaneously grow, under all settings. Reasonable accuracy can be achieved when $N$ and $J$ are reasonably large, in which case the SVD-based estimator may be directly used for data analysis. For example, under the setting that $K=4$ and $F$ is multivariate standard normal, the loss function is already around 0.006 when $J$ is 200. It suggests that the average entrywise error is around $0.08$ . In addition, the loss for the SVD-based estimator tends to be smaller when the factors are independent than that when they are correlated, for the same $N$ , $J$ , and $K$ . This is because, the signal in the data is weaker in the latter case, due to the redundant information in correlated factors.

Moreover, we compare the performance of the two estimators. The CJMLE is always more accurate than the SVD-based estimator. This is consistent with the asymptotic theory that the CJMLE is statistically more efficient. However, if we compare the computation time of the two approaches, the SVD-based estimator is substantially faster. Under the most time consuming setting where $J=1200,K=8$ and the factors are correlated, the SVD approach only takes about 60 seconds, while the CJMLE takes about 17 minutes. Note that as shown in Chen et al. (2019b), CJMLE is already substantially faster than the marginal maximum likelihood estimator. Given its reasonable accuracy and computational advantage, the SVD-based estimator may be a good alternative to the CJMLE and the MMLE in large-scale exploratory IFA problems.

5 Concluding Remarks

As shown in this note, the proposed SVD-based algorithm is statistically consistent and has good finite sample performance in large-scale exploratory IFA problems. Although not statistically most efficient, the algorithm has its unique strengths over other exploratory IFA methods. In particular, it is computationally much faster. In addition, it guarantees a unique solution, while most of the other estimators can suffer from convergence issues for involving nonconvex optimization, including the CJMLE and MMLE.

Given its computational advantages and good finite sample performance, the SVD-based estimator can be used, not only as a starting point for other estimators to improve their numerical convergence, but also as an alternative estimator for data analysis. Specifically, in large-scale exploratory IFA applications, we suggest to start data exploration with the SVD-based estimator. Using this estimator, we can quickly gain some understanding about the number of factors underlying the data, and the loading structures of IFA models assuming different numbers of factors. Such initial knowledge helps us to focus on a smaller set of latent dimension $K$ . For these latent dimensions, we tend to further investigate their loading structures by the CJMLE, using the corresponding SVD solutions as starting points. When sample and item sizes are relatively smaller, the traditional methods may be more suitable, such as the MMLE and the composite-likelihood-based estimator.

One limitation of the SVD-based estimator is that it is not easy to make statistical inference on the estimated loading matrix, such as constructing a confidence interval for an estimated loading parameter. This type of inference problem is not an issue for estimators based on the marginal likelihood, for which the asymptotic regime let $N$ diverge and keep $J$ fixed. However, it is a general challenge for both the SVD-based estimator and the CJMLE, whose consistency relies on a double asymptotic regime and the notion of consistency is weaker than that in the traditional sense. In recent years, this type of inference problems has received much attention in statistics (Chen et al., 2019a; Xia and Yuan, 2019). However, to the best of our knowledge, no results have been obtained under an IFA model. We leave this problem for future investigation.

Appendix

Appendix A Notations

Let $\Theta^{*}=(\bm{\theta}_{1}^{*},...,\bm{\theta}_{N}^{*})^{\top}=(\theta_{ik})_{N\times K},A^{*}=(\mathbf{a}_{1}^{*},...,\mathbf{a}_{J}^{*})^{\top}=(a_{jk})_{J\times K},$ and $\mbox{$ \mathbf{d} $}^{*}=(d_{1}^{*},...,d_{J}^{*})$ denote the true person parameters, factor loadings and intercept parameters, respectively. We also denote $\bm{\theta}_{i}^{+}=(1,(\bm{\theta}_{i}^{*})^{\top})^{\top},\mathbf{a}_{j}^{+}=(d_{j}^{*},(\mathbf{a}_{j}^{*})^{\top})^{\top},\quad\text{for }i=1,...,N,\quad j=1,...,J.$ We use $\mathbf{1}_{N},\mathbf{0}_{N}$ to denote $N$ dimensional vectors with all entries being $1$ and [math] respectively, and $B^{(K)}_{\mathbf{a}}(C)$ to denote the ball in $\mathbb{R}^{K}$ centered at $\mathbf{a}\in\mathbb{R}^{K}$ with radius $C.$ For a matrix $Z=(z_{ij})_{m\times n}$ and a function $f:\mathbb{R}\to\mathbb{R}$ , let $f(Z):=(f(z_{ij}))_{m\times n}$ . Let $\sigma_{k}(Z)$ denote the $k$ -th largest singular value of $Z,$ and $\|Z\|,\|Z\|_{*}$ denote the spectrum norm and nuclear norm of $Z$ , which is the largest singular value and the sum of all singular values, respectively. If $Z$ is a square matrix, let $\lambda_{k}(Z)$ denote the $k$ -th largest eigenvalue of $Z.$

We denote

[TABLE]

as the true probability matrix and define $\tilde{X}=(\tilde{x}_{ij})_{N\times J}$ by

[TABLE]

where $x_{ij}$ is defined in step 5 of Algorithm 2.

Throughout the proof, we use $c$ to denote constant, whose value may change from line to line or even within a line. We will drop the subscripts in $\epsilon_{N,J}$ and write $\epsilon$ for notional simplicity.

Appendix B Proof of Theorems

Proof of Theorem 1.

Since Theorem 1 is a special case of Proposition 4 when $p=1$ and $W=\mathbf{1}_{N}\mathbf{1}_{J}^{\top},$ we refer the readers to the proof of Proposition 4. ∎

Proof of Theorem 2.

Let $\sigma^{*}_{k}$ denote the $k$ th largest singular value of $\Theta^{*}(A^{*})^{\top}.$ Then we have

[TABLE]

By (D.12) in the proof of Lemma 1, we can get

[TABLE]

Notice that (B.2) holds as long as the input dimension in the algorithm is fixed. Combine (B.1) and (B.2) to have

[TABLE]

Notice that $\sigma_{K+1}^{*}=0$ and we get

[TABLE]

For $k=K,$ we get

[TABLE]

for any $\tilde{\epsilon}>0$ and thus

[TABLE]

For $\sigma_{K}^{*},$ we have

[TABLE]

The last inequality is due to condition A4. Let $\hat{\Sigma}=\frac{1}{N}\sum_{i=1}^{N}\bm{\theta}_{i}^{*}(\bm{\theta}_{i}^{*})^{\top}$ and it is not hard to verify that

[TABLE]

By law of large number, we know

[TABLE]

which leads to

[TABLE]

and thus

[TABLE]

Combining (B.3), (B.4), (B.5) and choosing $\tilde{\epsilon}=\frac{1}{2}\sqrt{\lambda_{K}(\Sigma^{*})},$ we have

[TABLE]

We complete the proof by choosing $\delta=\frac{1}{4}\sqrt{\lambda_{K}(\Sigma^{*})}.$ ∎

Appendix C Proof of Propositions

Proof of Proposition 1.

According to the choice of $\epsilon$ , we have $h(2\epsilon)\geq C\sqrt{C_{0}^{2}+1}.$ Then,

[TABLE]

We complete the proof by Theorem 1. ∎

Proof of Proposition 2.

For the logistic link function, we have

[TABLE]

Since $\bm{\theta}_{1}^{*}$ is a sub-Gaussian random vector, then $\|\bm{\theta}_{1}^{*}\|_{2}^{2}$ is an sub-exponential random variable, which means there exist constant $c_{1},c_{2}>0,$ such that for any $t>0,$ we have

[TABLE]

Then,

[TABLE]

Recall we choose $\epsilon=\gamma_{0}J^{-\gamma_{1}}$ in (9). Consequently,

[TABLE]

Therefore,

[TABLE]

where the second inequality is due to the assumption that $J^{\beta}\geq N$ . The above display together with (C.2) verifies (3). We proceed to verify (4). According to (C.1), we have

[TABLE]

Plugging in $\epsilon=\gamma_{0}J^{-\gamma_{1}}$ , the above equation becomes

[TABLE]

Thus, for $\gamma_{1}\in(0,\frac{1}{4(K+3)})$ ,

[TABLE]

This verifies (4) and completes the proof by applying Theorem 1.

∎

Proof of Proposition 3.

The proof of Proposition 3 is similar to proof of Lemma 1. We will only state the main steps and omit the repeating details. According to Lemma 3 in Appendix D, we have

[TABLE]

Recall that we assume $\|\bm{\theta}^{*}_{i}\|\leq C_{0}$ . Following the similar arguments as in the proof of Lemma 1, we have

[TABLE]

There is a difference from the proof of Lemma 1 that the rank of matrix $f(M_{\delta})$ is upper bounded by

[TABLE]

Choose $\delta=\left(\frac{cC_{0}^{K}}{JL^{2}\left(\sqrt{C_{0}^{2}+1}+C\right)^{2}}\right)^{\frac{1}{K+2}}$ , then

[TABLE]

Let $g(N,J):=cJ^{-\frac{1}{K+2}}+c\exp\left(-cN\right)$ . By taking expectation, we have

[TABLE]

For any $\Delta_{N,J}>0$ , by Chebyshev’s inequality, we have

[TABLE]

Thus, for any sequence $\Delta_{N,J}$ satisfying $\Delta_{N,J}=o(1)$ , we have

[TABLE]

In what follows, we restrict our analysis to the event $\left\{\frac{1}{NJ}\|\tilde{X}-X^{*}\|_{F}^{2}\leq\frac{g(N,J)}{\Delta_{N,J}}\right\}$ . By (7), we have $x^{*}_{ij}=f((\bm{\theta}_{i}^{*})^{\top}\mathbf{a}_{j}^{*}+d_{j}^{*})\in[2\epsilon,1-2\epsilon],$ which leads to

[TABLE]

Following the similar procedure as in proof of Lemma 1, we can further bound $\|\hat{X}-X^{*}\|_{F}^{2}$ by

[TABLE]

To summarize, we have

[TABLE]

for any $\Delta_{N,J}=o(1)$ . This implies $\frac{1}{NJ}\|\hat{X}-X^{*}\|_{F}^{2}=O_{p}\left(J^{-\frac{1}{K+2}}+\exp(-cN)\right)=O_{p}(J^{-\frac{1}{K+2}})$ , where the second equation is due to $N\geq J$ . ∎

Proof of Proposition 4.

We have

[TABLE]

where $\tilde{A}=A^{*}\Sigma^{\frac{1}{2}}.$ Let $\tilde{\Theta}=(\tilde{\bm{\theta}}_{1},...,\tilde{\bm{\theta}}_{N})^{\top}=\Theta^{*}\Sigma^{-\frac{1}{2}}$ . Then $\Theta^{*}(A^{*})^{\top}=\tilde{\Theta}\tilde{A}^{\top}$ , and $\tilde{\bm{\theta}}_{i}$ s are independent and identically distributed from a distribution $\tilde{F}$ which has mean $\mathbf{0}$ and covariance matrix $I_{K}$ . Therefore, it suffices to show $L_{N,J}(A^{*},\hat{A})\overset{pr}{\to}0$ when $\Sigma=I_{K}$ . We prove it through the following two lemmas whose proofs are given in Appendix D.

Lemma 1.

Assume conditions A1, A2, A3, A5 and A6 are satisfied and further assume that (3) and (4) are satisfied. Then,

[TABLE]

where $\hat{\Theta}$ and $\hat{A}$ are given in Algorithm 2.

Lemma 2.

Suppose conditions A1, A2 and A4 are satisfied and further suppose that

[TABLE]

Then, $L_{N,J}(A^{*},\hat{A})\overset{pr}{\rightarrow}0.$

We complete the proof. ∎

Appendix D Proof of Lemmas

Proof of Lemma 1.

We first give a lemma regarding the error bound for recovering the probability matrix $X^{*}.$

Lemma 3.

Given $X^{*}$ , we have

[TABLE]

Let

[TABLE]

where $C_{\epsilon}=h(2\epsilon)/C$ is a quantity depending on $\epsilon$ . Let

[TABLE]

Then, according to the condition (3)

[TABLE]

In what follows, we restrict the analysis to the event $\mathcal{A}_{N,J}$ . Let $\mathcal{G}_{1},\mathcal{G}_{2}$ be two $\delta$ -nets for $B^{(K)}_{0}(C_{\epsilon})$ and $B^{(K+1)}_{0}(C),$ respectively. This means $\mathcal{G}_{1}\subset B_{0}^{(K)}(C_{\epsilon}),\mathcal{G}_{2}\subset B_{0}^{(K+1)}(C)$ and

[TABLE]

For any $\bm{\theta}_{i}^{*},$ let $p(\bm{\theta}_{i}^{*})$ be a point in $\mathcal{G}_{1}$ such that

[TABLE]

which implies

[TABLE]

With a little abuse of notation, we use $p(\bm{\theta}_{i}^{+})$ to denote $(1,p(\bm{\theta}_{i}^{*})^{\top})^{\top}.$ For any $\mathbf{a}_{j}^{+},$ let $p(\mathbf{a}_{j}^{+})$ be a point in $\mathcal{G}_{2}$ such that

[TABLE]

It is not hard to see that we can find such $\mathcal{G}_{1},\mathcal{G}_{2}$ such that

[TABLE]

This is due to definition of $\mathcal{A}_{N,J}$ and condition A1. Let $M_{\delta}=(_{\delta}m_{ij})_{N\times J}$ , where ${}_{\delta}m_{ij}=f\left(p(\bm{\theta}_{i}^{+})^{\top}p(\mathbf{a}_{j}^{+})\right),$ then we have

[TABLE]

Now we provide an upper bound for $\|X^{*}\|_{*}$ on the right-hand side of (D.1). We have

[TABLE]

The second term on the right-hand side of the above display is bounded above by

[TABLE]

Now we consider the first term. We have

[TABLE]

So

[TABLE]

We have used the Lipschitz continuity in condition A3 here. Then the first term in (D.2) is bounded from above as

[TABLE]

Here we used the fact that the rank of the matrix $f(M^{*})-f(M_{\delta})$ cannot exceed $J$ according to condition A5. Combined (D.1), (D.2), (D.3) and (D.4), then on the event $\mathcal{A}_{N,J},$

[TABLE]

Choose $\delta=\left(\frac{cC^{K+1}}{JL^{2}(\sqrt{C_{\epsilon}^{2}+1}+C)^{2}}\right)^{\frac{1}{K+3}}$ , then

[TABLE]

which implies

[TABLE]

where we define $g(N,J):=cC_{\epsilon}^{\frac{K+1}{K+3}}J^{\frac{-1}{K+3}}+c\exp(-cN)$ . By Chebyshev’s inequality, for any $\Delta_{N,J}>0$ ,

[TABLE]

Thus,

[TABLE]

Let $\mathcal{B}_{N,J}:=\mathcal{A}_{N,J}\cap\{\frac{1}{NJ}\|\tilde{X}-X^{*}\|_{F}^{2}\leq\frac{g(N,J)}{\Delta_{N,J}}\},$ then according to (D.5) for any sequence $\Delta_{N,J}$ satisfying $\Delta_{N,J}=o(1)$ , we have

[TABLE]

We will restrict our analysis on $\mathcal{B}_{N,J}$ in what follows. Let $h(N,J)=\frac{g(N,J)}{\Delta_{N,J}}$ , then on $\mathcal{B}_{N,J},$ we have $\frac{1}{NJ}\|\tilde{X}-X^{*}\|_{F}^{2}\leq h(N,J).$

Recall $C_{\epsilon}=\frac{h(2\epsilon)}{C}$ . Then, according to the definition of the function $h$ and $C_{\epsilon}$ , we can see that $f(CC_{\epsilon}),f(-CC_{\epsilon})\in[2\epsilon,1-2\epsilon].$ This interval is non-empty because $\epsilon\leq\frac{1}{4}.$ Thus, when the event $\mathcal{B}_{N,J}$ happens, we have $x^{*}_{ij}=f((\bm{\theta}_{i}^{+})^{\top}\mathbf{a}_{j}^{+})\in[2\epsilon,1-2\epsilon],$ which leads to

[TABLE]

Since $\hat{X}$ and $\tilde{X}$ are not far away from each other by definition, we can bound $\|\hat{X}-X^{*}\|_{F}^{2}$ by

[TABLE]

where the last inequality is because $\epsilon\leq\frac{1}{4}$ . According to condition A3 and the above inequality, we have

[TABLE]

The first inequality holds because $x_{ij}^{*},\hat{x}_{ij}\in[\epsilon,1-\epsilon]$ on the event $\mathcal{B}_{N,J}$ .

We proceed to an upper bound of $\hat{M}-\Theta^{*}(A^{*})^{\top}$ . Recall that $M^{*}=\mathbf{1}_{N}(d^{*})^{\top}+\Theta^{*}(A^{*})^{\top},\tilde{M}=\hat{M}+\mathbf{1}_{N}\hat{d}.$ Let $H_{1}=\hat{M}-\Theta^{*}(A^{*})^{\top}$ and $H_{2}=\mathbf{1}_{N}(\hat{d})^{\top}-\mathbf{1}_{N}(d^{*})^{\top}.$ We have

[TABLE]

We first bound the trace term in the above display,

[TABLE]

Through simple algebra, we have $d_{j}^{*}=\frac{1}{N}\sum_{i=1}^{N}\left(m^{*}_{ij}+(\bm{\theta}_{i}^{*})^{\top}\mathbf{a}_{j}^{*}\right).$ By the definition of $\hat{d}_{j},$ we have $\hat{d}_{j}=\frac{1}{N}\sum_{i=1}^{N}\tilde{m}_{ij}.$ Then

[TABLE]

which leads to

[TABLE]

So we can bound $\left|tr\{H_{1}^{\top}H_{2}\}\right|$ by

[TABLE]

According to condition A2 and law of large number, we have

[TABLE]

for any $\xi>0.$ Let

[TABLE]

then we have

[TABLE]

for any $\xi>0.$ On $\mathcal{C}_{N,J,\xi},$ according to (D.7) , (D.10) and (D.11),

[TABLE]

Recall how we get $\hat{\Theta},\hat{A}$ in algorithm 2 and we have

[TABLE]

So

[TABLE]

which leads to

[TABLE]

where the first inequality is due to $\textrm{rank}\big{(}\hat{\Theta}\hat{A}^{\top}-\Theta^{*}(A^{*})^{\top}\big{)}\leq 2K$ , the second inequality is due to (D.13) and the last inequality is due to (D.12). Thus, on the event $\mathcal{C}_{N,J,\xi}$

[TABLE]

Recall

[TABLE]

where $\Delta_{N,J}$ could be any sequence satisfying $\Delta_{N,J}=o(1)$ . By (3), (4) and condition A5, there exists $\Delta_{N,J}=o(1)$ such that $\frac{h(N,J)}{(\epsilon g(\epsilon))^{2}}=o(1)$ . So fix any $\xi<1,$ for $N,J$ large enough, we have $\frac{h(N,J)}{(\epsilon g(\epsilon))^{2}}\leq\xi.$ Then there is a constant $\kappa$ such that for $N,J$ large enough, on $C_{N,J,\xi}$ with $\xi\in(0,1)$ , we have,

[TABLE]

This combined with $\Pr(C_{N,J,\xi})\to 1$ for any $\xi$ sufficiently small completes the proof. ∎

Proof of Lemma 2.

Let

[TABLE]

and in the following we will show that

[TABLE]

For any $\alpha>0,$ let

[TABLE]

Applying Theorem 5.39 of Vershynin (2010) to the matrix $\Theta^{*}$ , we have $\lim_{N,J\to\infty}\Pr(\mathcal{D}_{N,J,\alpha})=1$ for any $\alpha>0$ . We restrict our analysis on $\mathcal{D}_{N,J,\alpha}$ in what follows and denote

[TABLE]

Then,

[TABLE]

We consider $(b)$ first:

[TABLE]

For (a), notice that

[TABLE]

So

[TABLE]

Combine (D.17), (D.18) and (D.19), we get on $\mathcal{D}_{N,J,\alpha}$

[TABLE]

Recall that $Q(N,J)=\frac{1}{NJ}\|\hat{\Theta}\hat{A}^{\top}-\Theta^{*}(A^{*})^{\top}\|_{F}^{2}\overset{pr}{\rightarrow}0,\alpha$ can be arbitrarily small and $\Pr(\mathcal{D}_{N,J,\alpha})\to 1,$ we complete the proof. ∎

Proof of Lemma 3.

This lemma is almost the same as Theorem 1.1 of Chatterjee (2015) by setting, in his notations, $\eta=0.02$ and $\sigma^{2}=1/4,$ except two small differences. The first is that the probability $p$ can be changed through $N,J$ in the setting of Chatterjee (2015) while $p$ is a constant in our setting. Therefore we absorb $p$ into constants $c$ in the LHS of (D.1). The second difference is a modification in step 5 of Algorithm 2 that we require $X$ to include at least $K+1$ singular values of $Z.$ This does not change the result of Theorem 1.1 of Chatterjee (2015) given the following lemma which is based on Lemma 3.5 of Chatterjee (2015).

Lemma 4.

For fixed $0<m\leq n$ and a $m\times n$ matrix $A$ , let $A=\sum_{i=1}^{m}\sigma_{i}x_{i}y_{i}^{\top}$ be the singular value decomposition of A. Fix any $\delta>0$ and integer $T>0$ , and define

[TABLE]

where $l=\max\{T,\operatornamewithlimits{arg\,max}\{i:\sigma_{i}>(1+\delta)\|A-B\|\}\}.$ Then

[TABLE]

where $K(\delta)=(4+2\delta)\sqrt{2/\delta}+\sqrt{2+\delta}.$

Notice that we have another term $(1+\delta)\sqrt{T}\|A-B\|$ in (D.20) compared with Lemma 3.5 in Chatterjee (2015), which is due to the composition of $\tilde{B}.$ In the proof of Theorem 1.1 in Chatterjee (2015), by replacing Lemma 3.5 in Chatterjee (2015) by the above lemma with $T=K+1$ , we get

[TABLE]

The $1/J$ term in (D.21) results from the first term in (D.20). Notice that if

[TABLE]

then

[TABLE]

which leads to

[TABLE]

Therefore we can remove the $1/J$ term in (D.21) to complete the proof. ∎

Proof of Lemma 4.

Let

[TABLE]

and by Lemma 3.5 of Chatterjee (2015), we have

[TABLE]

Note that

[TABLE]

and we complete the proof by triangular inequality. ∎

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bartholomew et al. (2008) Bartholomew, D. J., Moustaki, I., Galbraith, J., and Steele, F. (2008). Analysis of multivariate social science data . CRC Press, Boca Raton, FL.
2Bock and Aitkin (1981) Bock, R. D. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika , 46:443–459.
3Bock et al. (1988) Bock, R. D., Gibbons, R., and Muraki, E. (1988). Full-information item factor analysis. Applied Psychological Measurement , 12:261–280.
4Browne (2001) Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research , 36:111–150.
5Cai (2010 a) Cai, L. (2010 a). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika , 75:33–57.
6Cai (2010 b) Cai, L. (2010 b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics , 35(3):307–335.
7Chatterjee (2015) Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics , 43:177–214.
8Chen et al. (2019 a) Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019 a). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences , 116:22931–22937.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Note on Exploratory Item Factor Analysis by Singular Value Decomposition

Abstract

1 Background

2 Main results

SVD-based estimator.

Algorithm 1** (SVD-based estimator for exploratory IFA).**

Remark 1**.**

Remark 2**.**

Remark 3**.**

Remark 4**.**

Remark 5**.**

Statistical consistency.

Theorem 1**.**

Remark 6**.**

Remark 7**.**

Remark 8**.**

Remark 9**.**

Remark 10**.**

Remark 11**.**

Proposition 1**.**

Proposition 2**.**

On the choice of KKK.

Theorem 2**.**

Remark 12**.**

Statistical efficiency.

Proposition 3**.**

3 Extensions

Dealing with missing data.

Algorithm 2** (SVD-based estimator for exploratory IFA with missing data).**

Remark 13**.**

Proposition 4**.**

Dealing with ordinal data.

Algorithm 3** (SVD-based estimator for exploratory IFA with ordinal data).**

4 Simulation

Simulation setting.

Results.

5 Concluding Remarks

Appendix A Notations

Appendix B Proof of Theorems

Proof of Theorem 1.

Proof of Theorem 2.

Appendix C Proof of Propositions

Proof of Proposition 1.

Proof of Proposition 2.

Proof of Proposition 3.

Proof of Proposition 4.

Lemma 1**.**

Lemma 2**.**

Appendix D Proof of Lemmas

Proof of Lemma 1.

Lemma 3**.**

Proof of Lemma 2.

Proof of Lemma 3.

Lemma 4**.**

Proof of Lemma 4.

Algorithm 1 (SVD-based estimator for exploratory IFA).

Remark 1.

Remark 2.

Remark 3.

Remark 4.

Remark 5.

Theorem 1.

Remark 6.

Remark 7.

Remark 8.

Remark 9.

Remark 10.

Remark 11.

Proposition 1.

Proposition 2.

On the choice of $K$ .

Theorem 2.

Remark 12.

Proposition 3.

Algorithm 2 (SVD-based estimator for exploratory IFA with missing data).

Remark 13.

Proposition 4.

Algorithm 3 (SVD-based estimator for exploratory IFA with ordinal data).

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.