sleev: An R Package for Semiparametric Likelihood Estimation with Errors in Variables

Jiangmei Xiong; Sarah C. Lotspeich; Joey B. Sherrill; Gustavo Amorim; Bryan E. Shepherd; Ran Tao

PMC · DOI:10.21105/joss.07320·February 21, 2026

sleev: An R Package for Semiparametric Likelihood Estimation with Errors in Variables

Jiangmei Xiong, Sarah C. Lotspeich, Joey B. Sherrill, Gustavo Amorim, Bryan E. Shepherd, Ran Tao

PDF

Open Access

TL;DR

The sleev R package provides a user-friendly tool for analyzing error-prone biomedical data using a robust statistical method.

Contribution

The sleev package introduces a computationally efficient and accessible implementation of the sieve maximum likelihood estimator for two-phase studies.

Findings

01

The package supports semiparametric likelihood-based inference for error-prone data with binary or continuous outcomes.

02

It enables analysis of data with error-prone covariates and responses using validated subsamples.

03

The method is efficient and robust for biomedical research using routinely collected data.

Abstract

Data with measurement error in the outcome, covariates, or both are not uncommon, particularly with the increased use of routinely collected data for biomedical research. With error-prone data, often only a subsample of study data is validated; such settings are known as two-phase studies. The sieve maximum likelihood estimator (SMLE), which combines the error-prone data on all records with the validated data on a subsample, is a highly efficient and robust method to analyze such data. However, given their complexity, a computationally efficient and user-friendly tool is needed to obtain the SMLEs. The R package sleev fills this gap by making semiparametric likelihood-based inference using the SMLEs for error-prone two-phase data in settings with binary and continuous outcomes. Functions from this package can be used to analyze data with error-prone binary or continuous responses and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

CD4

Proteins1

Species1

Homo sapiens(human · species)

Diseases2

TRUE FALSE

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetic and phenotypic traits in livestock · Statistical Methods and Inference · Data Analysis with R

Full text

Statement of Need

Routinely collected data are being used frequently in biomedical research, such as electronic health records. However, these data tend to be error-prone, and using these data without correcting for their error-prone nature could lead to biased estimates and misleading research findings (Duan et al., 2016). To avoid such invalid study results, trained experts carefully verify and extract data elements. However, it is usually only feasible to validate data for a subset of records or variables. After validation, researchers have error-prone, pre-validation data for all records (phase one) and error-free validated data on a subset of records (phase two). Analyses aim to combine the two types of data to obtain estimates that have low bias and are as robust and efficient as possible.

There are several packages for R (R Core Team, 2024) that address measurement error, including augSIMEX (Zhang & Yi, 2019), attenuation (Moss, 2019), decon (Wang & Wang, 2011), eivtools (Lockwood, 2018), GLSME (Hansen & Bartoszek, 2012), mecor (Nab et al., 2021), meerva (Kremers, 2021), mmc (Song, 2015), refitME (Stoklosa et al., 2021), and simex (Lederer & Seibold, 2019). The various R packages reflect many different approaches, such as regression calibration (Wang & Wang, 2011), SIMEX (i.e., simulation-extrapolation) (Lederer & Seibold, 2019), and moment-based corrections (Nab et al., 2021), to mention a few. Nearly all of these existing R packages deal with errors in either the outcome or covariates, but not both, and none of these packages permits efficient inference that incorporates both the error-prone phase-one data and the validated phase-two data.

The sieve maximum likelihood estimator (SMLE) is an estimator that analyzes two-phase data by combining the error-prone data on all records with the validated data on a subsample. By leveraging all available data, the SMLE operates with high efficiency (Lotspeich et al., 2022; Tao et al., 2021). Since it does not make any parametric assumptions on the error model, the SMLE is also robust. For example, Tao et al. (2021) performed a set of simulations highlighting the SMLE’s robustness to different error mechanisms including settings where the errors had non-zero mean or were multiplicative. Moreover, the SMLE allows error-prone outcome and error-prone covariates in the same model. Still, in practice these estimators can be difficult to implement, as they involve approximating nuisance conditional densities using B-splines (Schumaker, 2007) and then maximizing the semiparametric likelihood via a sophisticated EM algorithm (Tao et al., 2017). Here, we present the R package sleev, which makes the SMLE readily applicable for practitioners in a user-friendly way. sleev integrates and extends primitive R packages, logreg2ph and TwoPhaseReg, developed with the original methods papers (Lotspeich et al., 2022; Tao et al., 2021). These two packages lacked proper documentation and were difficult to use. logreg2ph was also computationally slow.

To promote the use of the SMLE, extensive work has been done to create sleev, a computationally efficient and user-friendly R package to analyze two-phase, error-prone data. Specifically, in sleev we rewrote the core algorithms of logreg2ph in C++ to speed up the computation, and we unified the syntax across functions. To compare the computational times, we set up simulations with the same code in the package vignette. The simulations included phase-one and phase-two sample sizes of 2087 and 835, respectively, and were performed on a 64-bit Linux OS machine with 8G memory. Across 100 simulations, the previous logreg2ph took an average of 289.44 seconds with a standard deviation of 8.83 seconds to perform the analysis, while the corresponding new function in sleev only took an average of 122.32 seconds with a standard deviation of 8.18 seconds.

SMLE for Linear Regression

In this section, we briefly introduce the SMLE for linear regression. Suppose that we want to fit a standard linear regression model for a continuous outcome $[eqn]$ and covariates $[eqn]$ , where $[eqn]$ . Our goal is to obtain estimates of $[eqn]$ . When we have error-prone data, $[eqn]$ and X are unobserved except for a subset of validated records. For unvalidated records (the majority), only the error-prone outcome $[eqn]$ and covariates $[eqn]$ are observed in place of $[eqn]$ and X, where W and U are the errors for the outcome and covariates, respectively. We assume that W and U are independent of $[eqn]$ . With potential errors in our data, a naive regression analysis using error-prone variables Y* and X* could render misleading results (Fuller, 2009).

We assume that the joint density of the complete data $[eqn]$ takes the form

[eqn]

where $[eqn]$ and $[eqn]$ denote density and conditional density functions, respectively. Specifically, $[eqn]$ then refers to the conditional density function of the linear regression model of $[eqn]$ given X. Denote the validation indicator variable by V, with V = 1 indicating that a record was validated and V = 0 otherwise. For records with V = 0, their measurement errors (W, U) are missing, and therefore their contributions to the log-likelihood can be obtained by integrating out W and U.

Let $[eqn]$ for $[eqn]$ denote independent and identically distributed realizations of $[eqn]$ in a sample of n subjects. Then, the observed-data log-likelihood is proportional to

[eqn]

where $[eqn]$ is left out, because the error-prone covariates are fully observed and thus $[eqn]$ can simply be estimated empirically. We estimate the unknown measurement error model, $[eqn]$ , using B-spline sieves. Specifically, we approximate $[eqn]$ and log $[eqn]$ by $[eqn]$ and $[eqn]$ , respectively. Here, $[eqn]$ are the m distinct observed (W, U) values from the validation study, $[eqn]$ is the $[eqn]$ B-spline basis function of order q evaluated at $[eqn]$ , sn is the dimension of the B-spline basis, and p_kj_ is the coefficient associated with $[eqn]$ and $[eqn]$ . The expression (1) is now approximated by

[eqn]

The maximization of expression (2) is carried out through an EM algorithm to find the SMLEs $[eqn]$ and $[eqn]$ . The covariance matrix of the SMLE $[eqn]$ is obtained through the method of profile likelihood (Murphy & Van der Vaart, 2000).

The SMLEs for logistic regression are similar to linear regression and described in the package vignette, and the theoretical properties can be found in Lotspeich et al. (2022).

Functionalities of the sleev R Package

The sleev package provides a user-friendly way to obtain the SMLEs and their standard errors. The package can be installed from CRAN or GitHub. The sleev package includes two main functions: linear2ph() and logistic2ph(), to fit linear and logistic regressions, respectively, under two-phase sampling with an error-prone outcome and covariates. The input arguments are similar for the two functions and listed in Table 1. In addition to the arguments for error-prone and error-free outcome and covariates, the user needs to specify the B-spline matrix $[eqn]$ to be used in the estimation of the error densities.

Example: Case study with mock data

For demonstration, the sleev package includes a dataset constructed to mimic data from the Vanderbilt Comprehensive Care Clinic (VCCC) patient records from Giganti et al. (2020). Table 2 describes the variables in this dataset.

We now illustrate how to obtain the SMLEs using the sleev package with the mock.vccc dataset. Specifically, we show how to fit a linear regression model in the presence of errors in both the outcome and covariates using the linear2ph() function. Situations with more covariates and examples with logistic regression are included in the package vignette.

This example fits a linear regression model with CD4 count at antiretroviral therapy (ART) initiation regressed on viral load (VL) at ART initiation, adjusting for sex at birth. Both CD4 and VL are error-prone, partially validated variables, whereas sex is error-free. Because of skewness, we often transform both CD4 and VL. In our analysis, CD4 was divided by 10 and square root transformed, and VL was log_10_ transformed:

library(“sleev”) data(“mock.vccc”) mock.vccc $CD4_val_sq10 <- sqrt(mock.vccc$ CD4_val / 10) mock.vccc $CD4_unval_sq10 <- sqrt(mock.vccc$ CD4_unval / 10) mock.vccc $VL_val_l10 <- log10(mock.vccc$ VL_val) mock.vccc $VL_unval_l10 <- log10(mock.vccc$ VL_unval)

To obtain the SMLEs, we first need to set up the B-spline basis for the error-prone covariate VL_unval_l10 (the transformed VL variable from phase one) and Sex. The spline2ph() function in the sleev package can set up the B-spline basis, and combine it with the input data for the final analysis. Here, we use a cubic B-spline basis with the degree = 3 argument. The size of the basis s_n_ is set to be 20, specified through the size = 20 argument. More details regarding order and size selection, as well as run time comparison of B-spline basis, are discussed in the package vignette. To allow possible heterogeneity in error distribution between males and females, we can set up B-spline basis separately and proportionally for the two Sex groups by specifying argument group = “Sex”. The described B-spline basis is constructed as follows.

sn <- 20 data.linear <- spline2ph(x = “VL_unval_l10”, data = mock.vccc, size = sn, degree = 3, group = “Sex”)

Alternatively, if the investigator has prior knowledge that the errors in VL_unval_l10 are likely to be homogeneous, one may fit a simpler model by not stratifying the B-spline basis by Sex.

Having constructed the B-spline basis, the SMLEs can be obtained by running the linear2ph() function on data. linear, as shown in the code below. Again, the inputs are explained in Table 1. The fitted SMLEs are stored in a list object of class linear2ph. Here, we assign the fitted SMLEs to the variable name res_linear. The list of class linear2ph contains five components: coefficient, covariance, sigma, converge, and converge_cov.

res_linear <- linear2ph(y_unval = “CD4_unval_sq10”, y = “CD4_val_sq10”, x_unval = “VL_unval_l10”, x = “VL_val_l10”, z = “Sex”, data = data. linear, hn_scale = 1, se = TRUE, tol = 1e-04, max_iter = 1000, verbose = FALSE)

We should first check if the EM algorithms for estimating the regression coefficients and their covariance matrix converged by using the print() for class linear2ph directly.

res_linear Call: linear2ph(y_unval = “CD4_unval_sq10”, y = “CD4_val_sq10”, x_unval = “VL_unval_l10”, x = “VL_val_l10”, z = “Sex”, data = data. linear, hn_scale = 1, se = TRUE, tol = 1e-04, max_iter = 1000, verbose = FALSE) The parameter estimation has converged. Coefficients: Intercept VL_val_l10 Sex 4.8209166 −0.1413168 0.2727984

The summary() function for the object of class linear2ph returns the estimated coefficients, their standard errors, test statistics, and p-values as follows:

summary(res_linear) Call: linear2ph(y_unval = “CD4_unval_sq10”, y = “CD4_val_sq10”, x_unval = “VL_unval_l10”, x = “VL_val_l10”, z = “Sex”, data = data. linear, hn_scale = 1, se = TRUE, tol = 1e-04, max_iter = 1000, verbose = FALSE) Coefficients: Estimate SE Statistic p-value Intercept 4.8209166 0.15865204 30.386729 0.0000000000 VL_val_l10 −0.1413168 0.03983406 −3.547636 0.0003887047 Sex 0.2727984 0.10888178 2.505455 0.0122294098

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Duan R, Cao M, Wu Y, Huang J, Denny JC, Xu H, & Chen Y (2016). An empirical study for impacts of measurement errors on EHR based association studies. AMIA Annual Symposium Proceedings, 2016, 1764.28269935 PMC 5333313 · pubmed ↗
2Fuller WA (2009). Measurement error models. John Wiley & Sons. 10.1002/9780470316665 · doi ↗
3Giganti MJ, Shaw PA, Chen G, Bebawy SS, Turner MM, Sterling TR, & Shepherd BE (2020). Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples, and multiple imputation. The Annals of Applied Statistics, 14(2), 1045. 10.1214/20-aoas 134332999698 PMC 7523695 · doi ↗ · pubmed ↗
4Hansen TF, & Bartoszek K (2012). Interpreting the evolutionary regression: The interplay between observational and biological errors in phylogenetic comparative studies. Systematic Biology, 61(3), 413–425. 10.1093/sysbio/syr 12222213708 · doi ↗ · pubmed ↗
5Kremers WK (2021). meerva: Analysis of data with measurement error using a validation subsample. 10.32614/CRAN.package.meerva · doi ↗
6Lederer W, & Seibold H (2019). Simex: SIMEX- and MCSIMEX-algorithm for measurement error models. 10.32614/CRAN.package.simex · doi ↗
7Lockwood JR (2018). eivtools: Measurement error modeling tools. 10.32614/CRAN.package.eivtools · doi ↗
8Lotspeich SC, Shepherd BE, Amorim GG, Shaw PA, & Tao R (2022). Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort. Biometrics, 78(4), 1674–1685. 10.1111/biom.1351234213008 PMC 8720323 · doi ↗ · pubmed ↗