TL;DR
semopy is a Python package for Structural Equation Modeling that offers faster execution and higher accuracy than existing tools like lavaan, facilitating integration into modern data analysis pipelines.
Contribution
The paper introduces semopy, a new open-source Python package for SEM, with a unique model generator and improved performance over existing R-based tools.
Findings
semopy outperforms lavaan in execution time
semopy achieves higher accuracy in SEM estimation
The package includes extensive usage examples and a model generator
Abstract
Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is , but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package to satisfy those criteria. The paper provides detailed examples of package usage and explains it's inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that significantly outperforms in execution time and accuracy.
| Fit index | Formula | Method |
|---|---|---|
| calc_chi2 | ||
| RMSEA | calc_rmsea | |
| GFI | calc_gfi | |
| AGFI | calc_agfi | |
| NFI | calc_nfi | |
| TLI | calc_tli | |
| CFI | calc_cfi | |
| AIC | calc_aic | |
| BIC | calc_bic |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
semopy: A Python package for Structural Equation Modeling
Georgy Meshcheryakov, Anna A. Igolkina
Abstract
Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is lavaan, but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package semopy to satisfy those criteria. The paper provides detailed examples of package usage and explains it’s inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that semopy significantly outperforms lavaan in execution time and accuracy.
1 Introduction
Structural Equation Modelling (SEM) can be defined as a diverse set of tools and approaches for describing and estimating causal relationships between variables, whether they be observable or latent. The very early beginnings of SEM models (Path analysis) were established in the first half of the 20th century by a geneticist and statistician Sewall Green Wright and deal with only observed variables [31]. Over the years, approaches for working with latent variables have been developed (e.g. factor analysis), and in modern SEM models, researchers can specify complex hybrids of path analysis models, confirmatory path analysis models (CFA) [19] and multivariate regression models. Being an umbrella term over the statistical approaches, SEM utilises particular statistical methods to estimate relationships between variables describing a variance-covariance structure of the data via model parameters [2].
The first SEM model is LISREL (linear structural relations) and contains two parts: (i) the structural part links latent variables to each other via a system of linear equations; (ii) the measurement part specifies linear influences of latent variables to observed variables [2]. Let be a vector of latent variables, be a vector of observed (manifest) variables, then the two parts of LISREL model are as follows:
[TABLE]
where and are matrices with linear parameters, and are independent error terms. Following the LISREL notation, complex SEM models are traditionally split into two parts: structural and measurement ones. The semopy package supports a general SEM model allowing the presence of observed variables in the structural part, as well as provides support for ordinal variables.
In this paper, we discuss the prerequisites for development of semopy package, explain the user-friendly syntax used for specifying an SEM model and provide quick start information on the usage of the semopy package. Then, we provide implementation details such as an underlying mathematical model, heuristics for choosing starting values, provide a list of objective functions and optimization techniques at user’s disposal. We also explain in brief statistics such as p-values and fit indices and introduce a testing framework that generates random sets of models and data. In the end, we compare semopy to lavaan [12] – the-state-of-the-art CRAN R package that implements SEM functionality.
1.1 Why do we need semopy?
Nowadays SEM is widely used in the fields of economics, psychology, sociology and bioinformatics [18, 28, 15] and there is a number of software working with SEM models. Most of them are either commercially distributed and non-open-sourced or do not cover the whole set of popular programming languages, e.g. Python.
For instance, LISREL[9], Mplus[25] and EQS [1] are proprietary and commercial softwares, hence, any adjustments to them to satisfy researchers needs are not possible. OpenMx[26], sem[8], lavaan are free and open-source popular CRAN packages, but they are all written in R. The only Python package is pypsy[5], but it is limited to basic SEM functionality and by the lack of documentation. Our reasons for developing the new SEM package, semopy, are as following:
- •
the need for SEM package which could be easily integrated into developmental and research pipelines in Python (especially into bioinformatic ones) [4, 3];
- •
the wish to outperform in execution time and accuracy the most cited open-source package, lavaan;
- •
the lack of a profound testing technique for new SEM methods and approaches.
1.2 Model syntax
To specify SEM models, The semopy uses the lavaan syntax, which is natural to describe regression models in R. The syntax supports three operator symbols characterising relationships between variables:
- •
to specify structural part,
- •
= to specify measurement part,
- •
\sim$$\sim to specify common variance between variables.
For example, let a linear equation in the structural part of SEM model take the form:
[TABLE]
where is a variable dependent on regressors and (Fig. 1 ), and are parameters, is an error term. In semopy syntax it can be rewritten as
eta3 x1 + x2
Likewise, to specify a measurement part, which relates manifest variables to latent variables, we use a special operator = which can be read as is measured by. The left side of the operator contains one latent variable, and the right contains its manifest variables separated by plus signs. For example, to define a latent variable by three indicators ( and ) (Fig. 1 ), the following expression should be used :
eta1 = y1 + y2 + y3
The third operator, \sim$$\sim, is used to specify a covariance (common variance) between a pair of variables from one part:
x1 \sim$$\sim x2
eta \sim$$\sim x3
Another option in the semopy syntax is fixing parameter values; the values should be placed as prefixes before variables:
eta 1*x1 + x2
eta = 2*y1 + y2 + y3
x1 \sim$$\sim 5*x2
We designed an example (Fig. 1) to demonstrate the diversity of relationships between variables that can be estimated in semopy. The example contains two exogenous latent variable , endogenous latent variables ( being also an output variable), exogenous observed variables , endogenous observed variables (the latter being an output variable) and a set of manifest variables (take a notice that and are shared between and respectively). There is also a cycle present in the model (). Moreover, we set additional parameters for covariances between [ and ] and between [ and ].
The description of the model in the semopy syntax is as follows:
# Structural part
eta3 x1 + x2
eta4 x3
x3 eta1 + eta2 + x1 + x4
x4 eta4
x5 x4
# Measurement part
eta1 = y1 + y2 + y3
eta2 = y3
eta3 = y4 + y5
eta4 = y4 + y6
# Additional covariances
eta2 \sim$$\sim x2
y5 \sim$$\sim y6
In the semopy, we introduced an additional operator, that specifies a statistical type of variables. The ”is” word is reserved as an operator symbol for type specification:
y1, y2 is ordinal
The first line sets to ordinal type.
1.3 Quickstart
The semopy package is available at PyPi software repository and can be installed by:
pip install semopy
The pipeline for working with SEM models in semopy consists of three steps: (i) specifying a model, (ii) loading a dataset to the model, (iii) estimating parameters of the model. Two main objects required for scpecifying and estimating an SEM model are Model and Optimizer.
Model is responsible for setting up a model from the proposed SEM syntax:
# The first step
from semopy import Model
mod = """␣x1␣␣x2␣+␣x3
␣␣␣␣␣␣␣␣␣␣x3␣␣x2␣+␣eta1
␣␣␣␣␣␣␣␣␣␣eta1␣=␣y1␣+␣y2␣+␣y3
␣␣␣␣␣␣␣␣␣␣eta1␣␣x1
␣␣␣␣␣␣"""
model = Model(mod)
Then a dataset should be provided; at this step the initial values of parameters are calculated:
# The second step
from pandas import read_csv
data = read_csv("my_data_file.csv", index_col=0)
model.load_dataset(data)
To estimate parameters of the model an Optimizer object should be initialised and estimation executed:
# The third step
from semopy import Optimizer
opt = Optimizer(model)
objective_function_value = opt.optimize()
The default objective function for estimating parameters is the likelihood function and the optimisation method is SLSQP (Sequential Least-Squares Quadratic Programming). However, the semopy supports a wide range of other objective functions and optimisation schemes being specified as parameters in the optimize method (see Section 2.4).
When optimization process is finished, user can check the parameters’ estimates and p-values by using inspect method:
from semopy import inspect
inspect(opt)
The important feature of the semopy is that one can run multiple optimization sessions in a row with preservation of previous parameters’ estimates. For instance, to use unweighted least squares estimates as a starting point for a maximum likelihood estimation, one can run:
model = Model(mod)
model.load_dataset(data)
opt = Optimizer(model)
opt.optimize(objective=’ULS’)
opt.optimize(objective=’MLW’)
2 Materials and methods
In this section, we explain in detail the underlying calculations in the semopy. We denote latent variables with , observed variables participating in relationships with latent variables with and manifest variables with ; let numbers of these variables be , and , respectively. We also assumed that all variables are normally distributed with zero means. Each group of variables, and , can be split into two categories: (1) exogenous and (2) endogenous; let denote them and , and , respectively. Let the number of variables in the obtained four groups are and , and , so that , .
2.1 Model
As usual for SEM models, we assume it consisted of two parts: structural and measurement ones. We consider the following generalisation of Eq. 1:
[TABLE]
where is a factor loading matrix with linear parameters. We introduce the following change of variables:
[TABLE]
and the Eq. 2 can be rewritten as:
[TABLE]
where and are matrices of parameters, and are vectors of error terms, which are assumed to be independent and normally distributed with zero means and covariances and , respectively.
Then we can infer covariance matrix of as a function of model parameters:
[TABLE]
Parameters in matrix
The size of matrix corresponds to the size of vector and equals to . The block representation of this matrix is not necessary, and we set the initial values of all parameters in matrix as zeroes.
Parameters in matrix
The size of matrix matches to the sizes of and vectors and equals to ; the block representation of is:
[TABLE]
is an adjacency matrix for variables in a measurement part (Eq. 2).
To define the scale of latent variables, we automatically fix one parameter in each column of to 1. To be specific, in a column this parameter is the factor loading for latent variable and it’s a manifest variable which is the first in the alphabet order of all manifest variable for ; we called this variable as the first indicator. Initial values of the remaining factor loading for the are linear regression coefficients between manifest variables and the first indicator being as a single regressor.
Parameters in matrix
The matrix is a square covariance matrix for the error term (Eq. 3) and it’s size is or, in other terms, . This matrix is symmetric and can be presented in the block form:
[TABLE]
where blocks reflect covariance matrices of variables mentioned in indexes. By default, we assume that the block is fixed and equals to sample covariance matrix for variables (exogenous observed variables). For (exogenous latent variables), we assumed the symmetric covariance matrix fully parametrised. Preventing covariances between and by default, we set as zero matrix. We also consider the covariance matrices between endogenous and exogenous variables as zero matrices. In the remaining matrices for covariances between endogenous variables (latent and observed) – , , – we set parameters in positions of variances and in positions of covariances between variables, which do not play a role of regressors in any equation of the structural part.
Parameters in Theta matrix
The matrix is symmetric square 4-block matrix of size having only one non-zero block ():
[TABLE]
Dy default we initialise as a diagonal matrix, however, the semopy syntax allows to parametrise it’s off-diagonal elements. We set the starting values for a diagonal element as a half of a sample variance for the manifest variable .
Example
For the model on Fig. 1, positions of parameters in matrices are presented on Fig. 2.
2.2 Loading the data
The semopy supports SEM models that contain not only continuous variables (assumed as normally distributed) but also discrete ones (assumed as ordinal). By default, all variables are assumed to be normally distributed. However, it is possible to manually specify the type of variables or to allow the automatic recognition of types.
# Providing a list of ordinal variables
variables = {’y1’, ’y2’}
model.load_dataset(data, ordcor=variables)
# Load data with automatically recognition of types
model.load_dataset(data, ordcor=True)
# Check set of ordinal variables
print(model.vars[’Categorical’])
Presence of ordinal variables makes the sample covariance matrix ”heterogenous”, i.e. containing polychoric (in-between ordinal) and polyserial (between ordinal and continuous variables) correlations [7] .
2.3 Objective functions
The semopy allows to chose one of three objective (loss) functions for parameters’ estimation. Two of them are based on a least-squares approach, and the remaining one (the default) represents the maximum likelihood approach. It should be noticed that all objective functions reflect (in different ways) the distance between the sample covariance matrix and the model covariance matrix for observed variables . In this section, we present both objective functions and their two first derivatives utilised in optimisation methods.
The objective function can be set as an argument in optimize method of Optimizer class:
opt.optimize(objective=’ULS’) # Unweighted least squares
opt.optimize(objective=’GLS’) # General least squares
opt.optimize(objective=’MLW’) # Wishart likelihood function
Unweighted Least Squares (ULS)
The ULS objective function can be written as follows:
[TABLE]
where is a set of all parameters in an SEM model. To accelerate some optimisation methods, we inferred formulas for components in the gradient and the Hessian for .
General Least Squares (GLS)
In contrast to ULS loss function, GLS approach considers the following loss function:
[TABLE]
After our inference, formulas for components in the gradient and the Hessian are available as well.
Wishart Maximum Likelihood (MLW)
The MLW objective function is based on the assumption that the observed variables follow the multivariate normal distribution, therefore, the sample covariance matrix of follows the Wishart distribution:
[TABLE]
where is number of degrees of freedom (sample size), is number of parameters (). In this case, the log likelihood ratio – a natural logarithm of ratio of likelihood for any given model to likelihood with a perfectly fitting model () – is following:
[TABLE]
Also, we inferred the analytical gradient and Hessian of .
2.4 Optimisation methods
To minimise an objective function, semopy has a variety of nonlinear solvers. Optimisation method can be selected through method argument to optimize function in Optimizer class, for example,
opt.optimize(method="SLSQP")
The full list of available optimisation methods is the following:
- •
SLSQP – SLSQP (Sequential Least-Squares Quadratic Programming) method [21, 22] from scipy;
- •
L-BFGS-B – L-BFGS-B (Broyden — Fletcher — Goldfarb — Shanno, limited memory) method [27, 32] from scipy;
- •
Portmin – FORTRAN PORT optimization library [11] wrapped with Python portmin wrapper. It incorporates SMSNO, SUMSL, HUMSL routines for cases when no analytical gradient is available, when analytical gradient is available and when both analytical gradient and hessian are available respectively;
- •
Adam – Stochastic optimisation method Adam[6, 30], our implementation;
- •
Nesterov –Stochastic Nesterov Accelerated Gradient method [30], our implementation;
- •
SGD – Stochastic Gradient Descent method [30], our implementation.
2.5 Statistics
The semopy provides methods to calculate important statistics: p-values for parameter estimates and various measurements of fit.
P-values for parameter estimates
The semopy utilises the Z-test to calculate p-values for parameter estimates under the assumption that parameters are normality distributed; : value of a parameter is equal to zero. The approach considers z-score:
[TABLE]
where is a vector of parameter estimates, is the standard error of estimates, which is proportional to variance: .
Based on the Cramér–Rao bound,
[TABLE]
where is a Fisher information matrix (FIM). The matrix can be defined as an observed or expected FIM. The observed FIM is the Hessian of a likelihood function at , .
P-values for parameter estimates are based on the Z-test expectation, that follows the multivariate normal distribution. By default, expected FIM is used, but a user may want to estimate observed FIM instead:
from semopy.stats import calculate_p_values
pvals = calculate_p_values(opt, information=’observed’)
Fit indices
The semopy supports numerous fit indices: , RMSEA, CFI, TLI, NFI GFI, AGFI [20, 17]. Some fit indices compare fitted and baseline (null or independence) models. In semopy, baseline model is a model with all regression coefficients, loading factors, covariances, variances of latent variables set to zero. Several methods use the degrees of freedom () metric, , where is a number of observed variables and is a number of parameters. All methods that estimate them require an Optimizer instance as an argument.
A fit index can be calculated by invoking a particular method with an Optimizer instance passed as an argument. For example, to calculate AIC or BIC one should invoke calculate_aic or calculate_bic method respectively. They automatically estimate as a Wishart likelihood by default. However, one might supply extra parameter lh as a value of :
from semopy import calc_aic, calc_likelihood
# Calculate multivariate normal likelihood of model fitted by opt
l = calc_likelihood(opt, dist=’normal’)
aic = calc_aic(opt, lh=l)
Alternatively, one might gather all statistics explained above by a single call to gather_statistics method:
from semopy import gather_statistics
stats = gather_statistics(opt)
2.6 Testing framework
In order to test the semopy package and compare it’s performance with the lavaan, we implemented a versatile system to generate benchmark SEM models. Based on input parameters, the system randomly generates a skeleton of structural and measurement parts of an SEM model, values of all parameters and the dataset appropriate for the model and parameters. The set of generator’s parameters includes:
- •
n_obs , total number of variables in the structural part;
- •
n_lat, number of latent variables in the structural part;
- •
n_cycle, minimal number of cycles in the structural part;
- •
n_manif/(l_manif, u_manif), range of possible numbers of manifest variables for a latent variable;
- •
p_manif, fraction of manifest variables to merge together.
- •
scale all parameters sampled from uniform distribution on domain are multiplied by this value.
- •
n_samples number of samples in a dataset.
In order to generate a model, one should run the following:
from semopy.model_generator import generate_model
n_manif = (l_manif, u_manif)
model, params, data = generate_model(n_obs, n_lat, n_manif,
p_manif, n_cycles, scale,
n_samples)
The algorithm generating benchmarks consists of four steps. At the first step, we construct a structural part of an SEM model. For this purpose, we randomly generate a directed acyclic graph of (n_obs + n_lat) nodes and then we add n_cycles extra directed edges between nodes to satisfy the minimal number of circles in the structural part. At last, n_lat nodes in the structural parts are picked as latent. At the second step, we construct the measurement part providing each latent variable with it’s own set of manifest variables, so that each latent variable has the random amount of manifest variables from the range (l_manif, u_manif). To allow a fraction of manifest variables measuring several latent variables, we consequently merge pairs of manifest variables from different latent variables until the resultant number of manifest variables reaches (1-f_manif) of the initial number of manifest variables. At the third step, values of parameters in and are uniformly sampled from the continuous uniform distribution on the symmetric support scaled by scale parameter. At the fourth step, we consistently generate values for variables starting from exogenous ones, moving along paths in the structural part, finishing with manifest variables. For each variable we add a random noise error from . After a primary dataset is generated, we removed the latent variables form it.
3 Results
To compare semopy and lavaan packages to each other we tested them on 15 benchmark sets of SEM models generated by the model generator explained above. Sets consist of 1000 random models and different parameters (see Table 2).
Then, we ran semopy and lavaan on each model from those sets and compared packages on the following measures of accuracy:
- •
relative error between the estimated () and exact parameter values (). As is a vector, we consider a mean relative error ;
- •
the obtained value of the objective functions ;
- •
number of failed optimisation processes.
- •
execution time
We consider an optimisation process failed (a package’s estimates diverge from true values) if any of the criteria are met:
- •
any parameter has a ”Not-a-Number” (NaN) value;
- •
the objective function returns NaN at given point (function with estimated parameters attains undefined value and/or violates innerly-defined boundaries).
- •
15 sets of models that we examined in our tests are provided in Table 2. All sets of models and their respective data used in the following tests as well as the results are available at the repository.
Next, we provide the results of our tests.
3.1 Performance
We’ve benchmarked both semopy and lavaan on a mixed set of models. We were using a computer with AMD A10-4600M CPU, OS Manjaro 4.14 and OpenBLAS as a backend for numpy and scipy.
3.2 Optimisation methods
All optimisation methods available in the package were also tested against each other.
As it can be seen from Table 3, SLSQP clearly outperforms other methods, however, it has to be noted that there are cases when SLSQP fails to correctly estimate parameters whereas other methods successfully find a solution. Thereof we conclude that other optimisation methods should be given a shot in case of SLSQP’s failure.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Peter M. Bentler. EQS structural equations program manual. BMDP Statistical Software, 1989.
- 2[2] Kenneth A. Bollen. Structural Equations with Latent Variables . Wiley-Interscience, 1989.
- 3[3] TIOBE Software BV. Tiobe index, 2019.
- 4[4] Pierre Carbonnelle. Pypl popularity of programming language, 2019.
- 5[5] Chris Dai. pypsy: psychometrics package, 2018.
- 6[6] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimization, December 2014.
- 7[7] Fritz Drasgow. Polychoric and Polyserial Correlations . American Cancer Society, 2006.
- 8[8] John Fox. Teacher’s corner: Structural equation modeling with the sem package in r. Structural Equation Modeling: A Multidisciplinary Journal , 13(3):465–486, 2006.
