semopy: A Python package for Structural Equation Modeling

Meshcheryakov Georgy; Igolkina Anna

arXiv:1905.09376·stat.AP·June 2, 2021

semopy: A Python package for Structural Equation Modeling

Meshcheryakov Georgy, Igolkina Anna

PDF

2 Repos

TL;DR

semopy is a Python package for Structural Equation Modeling that offers faster execution and higher accuracy than existing tools like lavaan, facilitating integration into modern data analysis pipelines.

Contribution

The paper introduces semopy, a new open-source Python package for SEM, with a unique model generator and improved performance over existing R-based tools.

Findings

01

semopy outperforms lavaan in execution time

02

semopy achieves higher accuracy in SEM estimation

03

The package includes extensive usage examples and a model generator

Abstract

Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is $lavaan$ , but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package $semopy$ to satisfy those criteria. The paper provides detailed examples of package usage and explains it's inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that $semopy$ significantly outperforms $lavaan$ in execution time and accuracy.

Tables1

Table 1. Table 1: where n 𝑛 n is a number of data samples, F ( θ ^ ) 𝐹 ^ 𝜃 F(\hat{\theta}) is a value that objective function attains at optimum, χ m 2 subscript superscript 𝜒 2 𝑚 \chi^{2}_{m} is a χ 2 superscript 𝜒 2 \chi^{2} statistics for the target model, χ b 2 subscript superscript 𝜒 2 𝑏 \chi^{2}_{b} is a χ 2 superscript 𝜒 2 \chi^{2} statistics for the baseline model, where d f m 𝑑 subscript 𝑓 𝑚 df_{m} is d f 𝑑 𝑓 df of target model and d f b 𝑑 subscript 𝑓 𝑏 df_{b} is d f 𝑑 𝑓 df of baseline model, k 𝑘 k is a number of parameters and L 𝐿 L is a value of a likelihood function.

Fit index	Formula	Method
$χ^{2}$	$n F (\hat{θ})$	calc_chi2
RMSEA	$\sqrt{\frac{χ^{2} / d f - 1}{n - 1}}$	calc_rmsea
GFI	$1 - \frac{χ_{m}^{2}}{χ_{b}^{2}}$	calc_gfi
AGFI	$1 - \frac{k (k + 1)}{2 d f} (1 - G F I)$	calc_agfi
NFI	$\frac{χ_{b}^{2} - χ_{m}^{2}}{χ_{b}^{2}}$	calc_nfi
TLI	$\frac{\frac{χ_{b}^{2}}{d f_{b}} - \frac{χ_{m}^{2}}{d f_{m}}}{\frac{χ_{b}^{2}}{d f_{b}} - 1}$	calc_tli
CFI	$1 - \frac{χ_{m}^{2} - d f_{m}}{χ_{b}^{2} - d f_{b}}$	calc_cfi
AIC	$2 (k - L)$	calc_aic
BIC	$(l n) (n) k - 2 L$	calc_bic

Equations30

{η = B η + ε, y = Λ η + δ,

{η = B η + ε, y = Λ η + δ,

η_{3} = β_{1} x_{1} + β_{2} x_{2} + ε,

η_{3} = β_{1} x_{1} + β_{2} x_{2} + ε,

⎩ ⎨ ⎧ [η x] \hfil y = B [η x] + ε = \tilde{Λ} η + δ,

⎩ ⎨ ⎧ [η x] \hfil y = B [η x] + ε = \tilde{Λ} η + δ,

ω = [η x]; z = [y x],

ω = [η x]; z = [y x],

{ω \hfil z = B ω + ε = Λ ω + δ,

{ω \hfil z = B ω + ε = Λ ω + δ,

co v (z) = Σ (θ)

co v (z) = Σ (θ)

Λ = [\tilde{Λ} (n_{y} \times n_{η}) 0 (n_{x} \times n_{η}) 0 (n_{y} \times n_{x}) I (n_{x} \times n_{x})],

Λ = [\tilde{Λ} (n_{y} \times n_{η}) 0 (n_{x} \times n_{η}) 0 (n_{y} \times n_{x}) I (n_{x} \times n_{x})],

Ψ = Ψ_{η^{(1)}} Ψ_{η^{(1)}, η^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(1)}}^{⊺} Ψ_{η^{(1)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, η^{(2)}} Ψ_{η^{(2)}} Ψ_{η^{(2)}, x^{(1)}}^{⊺} Ψ_{η^{(2)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(1)}} Ψ_{η^{(2)}, x^{(1)}} Ψ_{x^{(1)}} Ψ_{x^{(1)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(2)}} Ψ_{η^{(2)}, x^{(2)}} Ψ_{x^{(1)}, x^{(2)}} Ψ_{x^{(2)}} \par,

Ψ = Ψ_{η^{(1)}} Ψ_{η^{(1)}, η^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(1)}}^{⊺} Ψ_{η^{(1)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, η^{(2)}} Ψ_{η^{(2)}} Ψ_{η^{(2)}, x^{(1)}}^{⊺} Ψ_{η^{(2)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(1)}} Ψ_{η^{(2)}, x^{(1)}} Ψ_{x^{(1)}} Ψ_{x^{(1)}, x^{(2)}}^{⊺} Ψ_{η^{(1)}, x^{(2)}} Ψ_{η^{(2)}, x^{(2)}} Ψ_{x^{(1)}, x^{(2)}} Ψ_{x^{(2)}} \par,

Θ = [\tilde{Θ} (n_{y} \times n_{y}) 0 (n_{x} \times n_{y}) 0 (n_{y} \times n_{x}) 0 (n_{x} \times n_{x})] .

Θ = [\tilde{Θ} (n_{y} \times n_{y}) 0 (n_{x} \times n_{y}) 0 (n_{y} \times n_{x}) 0 (n_{x} \times n_{x})] .

F_{U L S} (θ) = tr [(Σ (θ) - S) (Σ (θ) - S)^{⊺}],

F_{U L S} (θ) = tr [(Σ (θ) - S) (Σ (θ) - S)^{⊺}],

F (θ)_{G L S} = tr [(E - Σ (θ) S^{- 1})^{2}]

F (θ)_{G L S} = tr [(E - Σ (θ) S^{- 1})^{2}]

\centering W (S ∣ n^{- 1} Σ, n) \propto \frac{e ^{- tr (n S Σ^{- 1}) /2} ∣ n S ∣ ^{(n - p - 1) /2}}{∣Σ ∣ ^{n /2}} \propto e^{- tr (n S Σ^{- 1}) /2} ∣ Σ ∣^{- n /2}, \@add@centering

\centering W (S ∣ n^{- 1} Σ, n) \propto \frac{e ^{- tr (n S Σ^{- 1}) /2} ∣ n S ∣ ^{(n - p - 1) /2}}{∣Σ ∣ ^{n /2}} \propto e^{- tr (n S Σ^{- 1}) /2} ∣ Σ ∣^{- n /2}, \@add@centering

F_{M L W} (θ) = tr [S Σ (θ)^{- 1}] + ln ∣ Σ (θ) ∣

F_{M L W} (θ) = tr [S Σ (θ)^{- 1}] + ln ∣ Σ (θ) ∣

Z (\hat{θ}) = \frac{θ ^}{S E ( θ ^ )},

Z (\hat{θ}) = \frac{θ ^}{S E ( θ ^ )},

v a r (\hat{θ}) \geq FIM (θ)^{- 1},

v a r (\hat{θ}) \geq FIM (θ)^{- 1},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

semopy: A Python package for Structural Equation Modeling

Georgy Meshcheryakov, Anna A. Igolkina

Abstract

Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is lavaan, but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package semopy to satisfy those criteria. The paper provides detailed examples of package usage and explains it’s inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that semopy significantly outperforms lavaan in execution time and accuracy.

1 Introduction

Structural Equation Modelling (SEM) can be defined as a diverse set of tools and approaches for describing and estimating causal relationships between variables, whether they be observable or latent. The very early beginnings of SEM models (Path analysis) were established in the first half of the 20th century by a geneticist and statistician Sewall Green Wright and deal with only observed variables [31]. Over the years, approaches for working with latent variables have been developed (e.g. factor analysis), and in modern SEM models, researchers can specify complex hybrids of path analysis models, confirmatory path analysis models (CFA) [19] and multivariate regression models. Being an umbrella term over the statistical approaches, SEM utilises particular statistical methods to estimate relationships between variables describing a variance-covariance structure of the data via model parameters [2].

The first SEM model is LISREL (linear structural relations) and contains two parts: (i) the structural part links latent variables to each other via a system of linear equations; (ii) the measurement part specifies linear influences of latent variables to observed variables [2]. Let $\eta$ be a vector of latent variables, $y$ be a vector of observed (manifest) variables, then the two parts of LISREL model are as follows:

[TABLE]

where $\mathrm{B}$ and $\mathrm{\Lambda}$ are matrices with linear parameters, $\varepsilon$ and $\delta$ are independent error terms. Following the LISREL notation, complex SEM models are traditionally split into two parts: structural and measurement ones. The semopy package supports a general SEM model allowing the presence of observed variables in the structural part, as well as provides support for ordinal variables.

In this paper, we discuss the prerequisites for development of semopy package, explain the user-friendly syntax used for specifying an SEM model and provide quick start information on the usage of the semopy package. Then, we provide implementation details such as an underlying mathematical model, heuristics for choosing starting values, provide a list of objective functions and optimization techniques at user’s disposal. We also explain in brief statistics such as p-values and fit indices and introduce a testing framework that generates random sets of models and data. In the end, we compare semopy to lavaan [12] – the-state-of-the-art CRAN R package that implements SEM functionality.

1.1 Why do we need semopy?

Nowadays SEM is widely used in the fields of economics, psychology, sociology and bioinformatics [18, 28, 15] and there is a number of software working with SEM models. Most of them are either commercially distributed and non-open-sourced or do not cover the whole set of popular programming languages, e.g. Python.

For instance, LISREL[9], Mplus[25] and EQS [1] are proprietary and commercial softwares, hence, any adjustments to them to satisfy researchers needs are not possible. OpenMx[26], sem[8], lavaan are free and open-source popular CRAN packages, but they are all written in R. The only Python package is pypsy[5], but it is limited to basic SEM functionality and by the lack of documentation. Our reasons for developing the new SEM package, semopy, are as following:

•

the need for SEM package which could be easily integrated into developmental and research pipelines in Python (especially into bioinformatic ones) [4, 3];

•

the wish to outperform in execution time and accuracy the most cited open-source package, lavaan;

•

the lack of a profound testing technique for new SEM methods and approaches.

1.2 Model syntax

To specify SEM models, The semopy uses the lavaan syntax, which is natural to describe regression models in R. The syntax supports three operator symbols characterising relationships between variables:

•

$\sim$ to specify structural part,

•

= $\sim$ to specify measurement part,

•

$\sim$$\sim$ to specify common variance between variables.

For example, let a linear equation in the structural part of SEM model take the form:

[TABLE]

where $\eta_{3}$ is a variable dependent on regressors $x_{1}$ and $x_{2}$ (Fig. 1 ), $\beta_{1}$ and $\beta_{2}$ are parameters, $\varepsilon$ is an error term. In semopy syntax it can be rewritten as

⬇

eta3 $\sim$ x1 + x2

Likewise, to specify a measurement part, which relates manifest variables to latent variables, we use a special operator = $\sim$ which can be read as is measured by. The left side of the operator contains one latent variable, and the right contains its manifest variables separated by plus signs. For example, to define a latent variable $\eta_{1}$ by three indicators ( $y_{1},y_{2}$ and $y_{3}$ ) (Fig. 1 ), the following expression should be used :

⬇

eta1 = $\sim$ y1 + y2 + y3

The third operator, $\sim$$\sim$ , is used to specify a covariance (common variance) between a pair of variables from one part:

⬇

x1 $\sim$$\sim$ x2

eta $\sim$$\sim$ x3

Another option in the semopy syntax is fixing parameter values; the values should be placed as prefixes before variables:

⬇

eta $\sim$ 1*x1 + x2

eta = $\sim$ 2*y1 + y2 + y3

x1 $\sim$$\sim$ 5*x2

We designed an example (Fig. 1) to demonstrate the diversity of relationships between variables that can be estimated in semopy. The example contains two exogenous latent variable $\eta_{1},\eta_{2}$ , endogenous latent variables $\eta_{3},\eta_{4}$ ( $\eta_{3}$ being also an output variable), exogenous observed variables $x_{1},x_{2}$ , endogenous observed variables $x_{3},x_{4},x_{5}$ (the latter being an output variable) and a set of manifest variables $y_{1},y_{2},y_{3},y_{4},y_{5},y_{6}$ (take a notice that $y_{3}$ and $y_{4}$ are shared between $\eta_{1},\eta_{2}$ and $\eta_{3},\eta_{4}$ respectively). There is also a cycle present in the model ( $x_{3}\rightarrow\eta_{4}\rightarrow x_{4}\rightarrow x_{3}$ ). Moreover, we set additional parameters for covariances between [ $\eta_{2}$ and $x_{2}$ ] and between [ $y_{6}$ and $y_{5}$ ].

The description of the model in the semopy syntax is as follows:

⬇

# Structural part

eta3 $\sim$ x1 + x2

eta4 $\sim$ x3

x3 $\sim$ eta1 + eta2 + x1 + x4

x4 $\sim$ eta4

x5 $\sim$ x4

# Measurement part

eta1 = $\sim$ y1 + y2 + y3

eta2 = $\sim$ y3

eta3 = $\sim$ y4 + y5

eta4 = $\sim$ y4 + y6

# Additional covariances

eta2 $\sim$$\sim$ x2

y5 $\sim$$\sim$ y6

In the semopy, we introduced an additional operator, that specifies a statistical type of variables. The ”is” word is reserved as an operator symbol for type specification:

⬇

y1, y2 is ordinal

The first line sets $y_{1},y_{2}$ to ordinal type.

1.3 Quickstart

The semopy package is available at PyPi software repository and can be installed by:

⬇

pip install semopy

The pipeline for working with SEM models in semopy consists of three steps: (i) specifying a model, (ii) loading a dataset to the model, (iii) estimating parameters of the model. Two main objects required for scpecifying and estimating an SEM model are Model and Optimizer.

Model is responsible for setting up a model from the proposed SEM syntax:

⬇

# The first step

from semopy import Model

mod = """␣x1␣ $\sim$ ␣x2␣+␣x3

␣␣␣␣␣␣␣␣␣␣x3␣ $\sim$ ␣x2␣+␣eta1

␣␣␣␣␣␣␣␣␣␣eta1␣= $\sim$ ␣y1␣+␣y2␣+␣y3

␣␣␣␣␣␣␣␣␣␣eta1␣ $\sim$ ␣x1

␣␣␣␣␣␣"""

model = Model(mod)

Then a dataset should be provided; at this step the initial values of parameters are calculated:

⬇

# The second step

from pandas import read_csv

data = read_csv("my_data_file.csv", index_col=0)

model.load_dataset(data)

To estimate parameters of the model an Optimizer object should be initialised and estimation executed:

⬇

# The third step

from semopy import Optimizer

opt = Optimizer(model)

objective_function_value = opt.optimize()

The default objective function for estimating parameters is the likelihood function and the optimisation method is SLSQP (Sequential Least-Squares Quadratic Programming). However, the semopy supports a wide range of other objective functions and optimisation schemes being specified as parameters in the optimize method (see Section 2.4).

When optimization process is finished, user can check the parameters’ estimates and p-values by using inspect method:

⬇

from semopy import inspect

inspect(opt)

The important feature of the semopy is that one can run multiple optimization sessions in a row with preservation of previous parameters’ estimates. For instance, to use unweighted least squares estimates as a starting point for a maximum likelihood estimation, one can run:

⬇

model = Model(mod)

model.load_dataset(data)

opt = Optimizer(model)

opt.optimize(objective=’ULS’)

opt.optimize(objective=’MLW’)

2 Materials and methods

In this section, we explain in detail the underlying calculations in the semopy. We denote latent variables with $\eta$ , observed variables participating in relationships with latent variables with $x$ and manifest variables with $y$ ; let numbers of these variables be $n_{\eta}$ , $n_{x}$ and $n_{y}$ , respectively. We also assumed that all variables are normally distributed with zero means. Each group of variables, $\eta$ and $x$ , can be split into two categories: (1) exogenous and (2) endogenous; let denote them $\eta^{(1)}$ and $\eta^{(2)}$ , $x^{(1)}$ and $x^{(2)}$ , respectively. Let the number of variables in the obtained four groups are $n_{\eta^{(1)}}$ and $n_{\eta^{(2)}}$ , $n_{x^{(1)}}$ and $n_{x^{(2)}}$ , so that $n_{\eta}=n_{\eta^{(1)}}+n_{\eta^{(2)}}$ , $n_{x}=n_{x^{(1)}}+n_{x^{(2)}}$ .

2.1 Model

As usual for SEM models, we assume it consisted of two parts: structural and measurement ones. We consider the following generalisation of Eq. 1:

[TABLE]

where $\tilde{\Lambda}$ is a factor loading matrix with linear parameters. We introduce the following change of variables:

[TABLE]

and the Eq. 2 can be rewritten as:

[TABLE]

where $\mathrm{B}$ and $\Lambda$ are matrices of parameters, $\varepsilon$ and $\delta$ are vectors of error terms, which are assumed to be independent and normally distributed with zero means and covariances $\Psi$ and $\Theta$ , respectively.

Then we can infer covariance matrix of $z$ as a function of model parameters:

[TABLE]

Parameters in $\mathrm{B}$ matrix

The size of $\mathrm{B}$ matrix corresponds to the size of $\omega$ vector and equals to $(n_{\eta}+n_{x}\times n_{\eta}+n_{x})$ . The block representation of this matrix is not necessary, and we set the initial values of all parameters in $\mathrm{B}$ matrix as zeroes.

Parameters in $\Lambda$ matrix

The size of $\Lambda$ matrix matches to the sizes of $z$ and $\omega$ vectors and equals to $(n_{y}+n_{x}\times n_{\eta}+n_{x})$ ; the block representation of $\Lambda$ is:

[TABLE]

$\tilde{\Lambda}$ is an adjacency matrix for variables in a measurement part (Eq. 2).

To define the scale of latent variables, we automatically fix one parameter in each column of $\tilde{\Lambda}$ to 1. To be specific, in a column $i$ this parameter is the factor loading for $\eta_{i}$ latent variable and it’s a manifest variable which is the first in the alphabet order of all manifest variable for $\eta_{i}$ ; we called this variable as the first indicator. Initial values of the remaining factor loading for the $\eta_{i}$ are linear regression coefficients between manifest variables and the first indicator being as a single regressor.

Parameters in $\Psi$ matrix

The matrix $\Psi$ is a square covariance matrix for the error term $\varepsilon$ (Eq. 3) and it’s size is $(n_{\eta}+n_{x}\times n_{\eta}+n_{x})$ or, in other terms, $(n_{\eta^{(1)}}+n_{\eta^{(2)}}+n_{x^{(1)}}+n_{x^{(2)}}\times n_{\eta^{(1)}}+n_{\eta^{(2)}}+n_{x^{(1)}}+n_{x^{(2)}})$ . This matrix is symmetric and can be presented in the block form:

[TABLE]

where blocks reflect covariance matrices of variables mentioned in indexes. By default, we assume that the block $\Psi_{x^{(1)}}$ is fixed and equals to sample covariance matrix for $x^{(1)}$ variables (exogenous observed variables). For $\eta^{(1)}$ (exogenous latent variables), we assumed the symmetric $\Psi_{\eta^{(1)}}$ covariance matrix fully parametrised. Preventing covariances between $x^{(1)}$ and $\eta^{(1)}$ by default, we set $\Psi_{\eta^{(1)},x^{(1)}}$ as zero matrix. We also consider the covariance matrices between endogenous and exogenous variables as zero matrices. In the remaining matrices for covariances between endogenous variables (latent and observed) – $\Psi_{\eta^{(2)}}$ , $\Psi_{x^{(2)}}$ , $\Psi_{\eta^{(2)},x^{(2)}}$ – we set parameters in positions of variances and in positions of covariances between variables, which do not play a role of regressors in any equation of the structural part.

Parameters in Theta matrix

The $\Theta$ matrix is symmetric square 4-block matrix of $(n_{y}+n_{x}\times n_{y}+n_{x})$ size having only one non-zero block ( $\tilde{\Theta}$ ):

[TABLE]

Dy default we initialise $\tilde{\Theta}$ as a diagonal matrix, however, the semopy syntax allows to parametrise it’s off-diagonal elements. We set the starting values for a diagonal element $\tilde{\Theta}_{i,i}$ as a half of a sample variance for the manifest variable $y_{i}$ .

Example

For the model on Fig. 1, positions of parameters in matrices are presented on Fig. 2.

2.2 Loading the data

The semopy supports SEM models that contain not only continuous variables (assumed as normally distributed) but also discrete ones (assumed as ordinal). By default, all variables are assumed to be normally distributed. However, it is possible to manually specify the type of variables or to allow the automatic recognition of types.

⬇

# Providing a list of ordinal variables

variables = {’y1’, ’y2’}

model.load_dataset(data, ordcor=variables)

# Load data with automatically recognition of types

model.load_dataset(data, ordcor=True)

# Check set of ordinal variables

print(model.vars[’Categorical’])

Presence of ordinal variables makes the sample covariance matrix $S$ ”heterogenous”, i.e. containing polychoric (in-between ordinal) and polyserial (between ordinal and continuous variables) correlations [7] .

2.3 Objective functions

The semopy allows to chose one of three objective (loss) functions for parameters’ estimation. Two of them are based on a least-squares approach, and the remaining one (the default) represents the maximum likelihood approach. It should be noticed that all objective functions reflect (in different ways) the distance between the sample covariance matrix $\mathrm{S}$ and the model covariance matrix $\Sigma(\theta)$ for observed variables $z$ . In this section, we present both objective functions and their two first derivatives utilised in optimisation methods.

The objective function can be set as an argument in optimize method of Optimizer class:

⬇

opt.optimize(objective=’ULS’) # Unweighted least squares

opt.optimize(objective=’GLS’) # General least squares

opt.optimize(objective=’MLW’) # Wishart likelihood function

Unweighted Least Squares (ULS)

The ULS objective function can be written as follows:

[TABLE]

where $\theta$ is a set of all parameters in an SEM model. To accelerate some optimisation methods, we inferred formulas for components in the gradient and the Hessian for $F_{ULS}(\theta)$ .

General Least Squares (GLS)

In contrast to ULS loss function, GLS approach considers the following loss function:

[TABLE]

After our inference, formulas for components in the gradient and the Hessian are available as well.

Wishart Maximum Likelihood (MLW)

The MLW objective function is based on the assumption that the observed variables $z$ follow the multivariate normal distribution, therefore, the sample covariance matrix of $z$ follows the Wishart distribution:

[TABLE]

where $n$ is number of degrees of freedom (sample size), $p$ is number of parameters ( $=|\theta|$ ). In this case, the log likelihood ratio – a natural logarithm of ratio of likelihood for any given model to likelihood with a perfectly fitting model ( $\Sigma=S$ ) – is following:

[TABLE]

Also, we inferred the analytical gradient and Hessian of $F_{MLW}$ .

2.4 Optimisation methods

To minimise an objective function, semopy has a variety of nonlinear solvers. Optimisation method can be selected through method argument to optimize function in Optimizer class, for example,

⬇

opt.optimize(method="SLSQP")

The full list of available optimisation methods is the following:

•

SLSQP – SLSQP (Sequential Least-Squares Quadratic Programming) method [21, 22] from scipy;

•

L-BFGS-B – L-BFGS-B (Broyden — Fletcher — Goldfarb — Shanno, limited memory) method [27, 32] from scipy;

•

Portmin – FORTRAN PORT optimization library [11] wrapped with Python portmin wrapper. It incorporates SMSNO, SUMSL, HUMSL routines for cases when no analytical gradient is available, when analytical gradient is available and when both analytical gradient and hessian are available respectively;

•

Adam – Stochastic optimisation method Adam[6, 30], our implementation;

•

Nesterov –Stochastic Nesterov Accelerated Gradient method [30], our implementation;

•

SGD – Stochastic Gradient Descent method [30], our implementation.

2.5 Statistics

The semopy provides methods to calculate important statistics: p-values for parameter estimates and various measurements of fit.

P-values for parameter estimates

The semopy utilises the Z-test to calculate p-values for parameter estimates under the assumption that parameters are normality distributed; $H_{0}$ : value of a parameter is equal to zero. The approach considers z-score:

[TABLE]

where $\hat{\theta}$ is a vector of parameter estimates, $SE(\hat{\theta})$ is the standard error of estimates, which is proportional to variance: $SE(\hat{\theta})=var(\hat{\theta})/\sqrt{n}$ .

Based on the Cramér–Rao bound,

[TABLE]

where $\mathrm{FIM}(\theta)$ is a Fisher information matrix (FIM). The $\mathrm{FIM}(\theta)$ matrix can be defined as an observed or expected FIM. The observed FIM is the Hessian of a likelihood function at $\hat{\theta}$ , $H(\hat{\theta})$ .

P-values for parameter estimates are based on the Z-test expectation, that $Z(\theta)$ follows the multivariate normal distribution. By default, expected FIM is used, but a user may want to estimate observed FIM instead:

⬇

from semopy.stats import calculate_p_values

pvals = calculate_p_values(opt, information=’observed’)

Fit indices

The semopy supports numerous fit indices: $\chi^{2}$ , RMSEA, CFI, TLI, NFI GFI, AGFI [20, 17]. Some fit indices compare fitted and baseline (null or independence) models. In semopy, baseline model is a model with all regression coefficients, loading factors, covariances, variances of latent variables set to zero. Several methods use the degrees of freedom ( $df$ ) metric, $df=\frac{k(k+1)}{2}-m$ , where $k$ is a number of observed variables and $m$ is a number of parameters. All methods that estimate them require an Optimizer instance as an argument.

A fit index can be calculated by invoking a particular method with an Optimizer instance passed as an argument. For example, to calculate AIC or BIC one should invoke calculate_aic or calculate_bic method respectively. They automatically estimate $L$ as a Wishart likelihood by default. However, one might supply extra parameter lh as a value of $L$ :

⬇

from semopy import calc_aic, calc_likelihood

# Calculate multivariate normal likelihood of model fitted by opt

l = calc_likelihood(opt, dist=’normal’)

aic = calc_aic(opt, lh=l)

Alternatively, one might gather all statistics explained above by a single call to gather_statistics method:

⬇

from semopy import gather_statistics

stats = gather_statistics(opt)

2.6 Testing framework

In order to test the semopy package and compare it’s performance with the lavaan, we implemented a versatile system to generate benchmark SEM models. Based on input parameters, the system randomly generates a skeleton of structural and measurement parts of an SEM model, values of all parameters and the dataset appropriate for the model and parameters. The set of generator’s parameters includes:

•

n_obs , total number of variables in the structural part;

•

n_lat, number of latent variables in the structural part;

•

n_cycle, minimal number of cycles in the structural part;

•

n_manif/(l_manif, u_manif), range of possible numbers of manifest variables for a latent variable;

•

p_manif, fraction of manifest variables to merge together.

•

scale all parameters sampled from uniform distribution on domain $[-1.0,-0.1]\cup[0.1,1.0]$ are multiplied by this value.

•

n_samples number of samples in a dataset.

In order to generate a model, one should run the following:

⬇

from semopy.model_generator import generate_model

n_manif = (l_manif, u_manif)

model, params, data = generate_model(n_obs, n_lat, n_manif,

                                 p_manif, n_cycles, scale,

                                 n_samples)

The algorithm generating benchmarks consists of four steps. At the first step, we construct a structural part of an SEM model. For this purpose, we randomly generate a directed acyclic graph of (n_obs + n_lat) nodes and then we add n_cycles extra directed edges between nodes to satisfy the minimal number of circles in the structural part. At last, n_lat nodes in the structural parts are picked as latent. At the second step, we construct the measurement part providing each latent variable with it’s own set of manifest variables, so that each latent variable has the random amount of manifest variables from the range (l_manif, u_manif). To allow a fraction of manifest variables measuring several latent variables, we consequently merge pairs of manifest variables from different latent variables until the resultant number of manifest variables reaches (1-f_manif) of the initial number of manifest variables. At the third step, values of parameters in $\mathrm{B}$ and $\Lambda$ are uniformly sampled from the continuous uniform distribution on the symmetric support $[-1,-0.1]\cup[0.1,1]$ scaled by scale parameter. At the fourth step, we consistently generate values for variables starting from exogenous ones, moving along paths in the structural part, finishing with manifest variables. For each variable we add a random noise error from $\mathcal{N}(0,0.1)$ . After a primary dataset is generated, we removed the latent variables form it.

3 Results

To compare semopy and lavaan packages to each other we tested them on 15 benchmark sets of SEM models generated by the model generator explained above. Sets consist of 1000 random models and different parameters (see Table 2).

Then, we ran semopy and lavaan on each model from those sets and compared packages on the following measures of accuracy:

•

relative error between the estimated ( $\hat{\theta}$ ) and exact parameter values ( $\theta$ ). As $\theta$ is a vector, we consider a mean relative error $\Delta(\theta,\hat{\theta})=\frac{1}{n}\sum_{i=1}^{n}\frac{|\hat{\theta_{i}}-\theta_{i}|}{|\theta_{i}|}$ ;

•

the obtained value of the objective functions $F(\hat{\theta})$ ;

•

number of failed optimisation processes.

•

execution time

We consider an optimisation process failed (a package’s estimates diverge from true values) if any of the criteria are met:

•

any parameter has a ”Not-a-Number” (NaN) value;

•

the objective function returns NaN at given point (function with estimated parameters attains undefined value and/or violates innerly-defined boundaries).

•

$\Delta(\theta,\hat{\theta})<0.3$

15 sets of models that we examined in our tests are provided in Table 2. All sets of models and their respective data used in the following tests as well as the results are available at the repository.

Next, we provide the results of our tests.

3.1 Performance

We’ve benchmarked both semopy and lavaan on a mixed set of models. We were using a computer with AMD A10-4600M CPU, OS Manjaro 4.14 and OpenBLAS as a backend for numpy and scipy.

3.2 Optimisation methods

All optimisation methods available in the package were also tested against each other.

As it can be seen from Table 3, SLSQP clearly outperforms other methods, however, it has to be noted that there are cases when SLSQP fails to correctly estimate parameters whereas other methods successfully find a solution. Thereof we conclude that other optimisation methods should be given a shot in case of SLSQP’s failure.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Peter M. Bentler. EQS structural equations program manual. BMDP Statistical Software, 1989.
2[2] Kenneth A. Bollen. Structural Equations with Latent Variables . Wiley-Interscience, 1989.
3[3] TIOBE Software BV. Tiobe index, 2019.
4[4] Pierre Carbonnelle. Pypl popularity of programming language, 2019.
5[5] Chris Dai. pypsy: psychometrics package, 2018.
6[6] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimization, December 2014.
7[7] Fritz Drasgow. Polychoric and Polyserial Correlations . American Cancer Society, 2006.
8[8] John Fox. Teacher’s corner: Structural equation modeling with the sem package in r. Structural Equation Modeling: A Multidisciplinary Journal , 13(3):465–486, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

semopy: A Python package for Structural Equation Modeling

Abstract

1 Introduction

1.1 Why do we need semopy?

1.2 Model syntax

1.3 Quickstart

2 Materials and methods

2.1 Model

Parameters in B\mathrm{B}B matrix

Parameters in Λ\LambdaΛ matrix

Parameters in Ψ\PsiΨ matrix

Parameters in Theta matrix

Example

2.2 Loading the data

2.3 Objective functions

Unweighted Least Squares (ULS)

General Least Squares (GLS)

Wishart Maximum Likelihood (MLW)

2.4 Optimisation methods

2.5 Statistics

P-values for parameter estimates

Fit indices

2.6 Testing framework

3 Results

3.1 Performance

3.2 Optimisation methods

Parameters in $\mathrm{B}$ matrix

Parameters in $\Lambda$ matrix

Parameters in $\Psi$ matrix