Logistic Box-Cox Regression to Assess the Shape and Median Effect under   Uncertainty about Model Specification

Li Xing; Xuekui Zhang; Igor Burstyn; Paul Gustafson

arXiv:1901.11362·stat.ME·February 1, 2019

Logistic Box-Cox Regression to Assess the Shape and Median Effect under Uncertainty about Model Specification

Li Xing, Xuekui Zhang, Igor Burstyn, Paul Gustafson

PDF

Open Access

TL;DR

This paper introduces a Logistic Box-Cox regression approach to better understand the shape of exposure-disease relationships and accurately estimate the median effect under model uncertainty in epidemiologic studies.

Contribution

It develops a novel regression method that accounts for shape uncertainty and improves inference of the exposure-disease relationship and median effect.

Findings

01

The method accurately infers the shape of the relationship.

02

It provides precise estimates of the median effect.

03

The approach outperforms traditional two-step methods.

Abstract

The shape of the relationship between a continuous exposure variable and a binary disease variable is often central to epidemiologic investigations. This paper investigates a number of issues surrounding inference and the shape of the relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the median effect of exposure, and investigate how precisely this can be inferred. This is done both in the case of using a model acknowledging uncertainty about the shape parameter and in the case of ignoring this uncertainty and using a two-step method, where in step one we transform the predictor and in step two we fit a simple linear model with transformed…

Tables2

Table 1. Table 1: The Simulation Settings

Distribution of $X$	Shape of Relationship	Disease Rarity	Exposure-Disease
$LN (μ, σ)$	$λ$	$P (Y = 1 \| X =$ 5 $t h Q)$	Association
	log $λ = 0$
weakly skewed $σ = 0.5$	square-root $λ = 0.5$	low $P_{1} = 0.02$	weak $R_{1} = 1.1$
mild skewed $σ = 1$	linear $λ = 1$		mild $R_{2} = 2$
highly skewed $σ = 2$	square $λ = 2$	high $P_{2} = 0.1$	strong $R_{3} = 5$

Table 2. Table 2: Estimates from the Misspecified Models, Estimated Median Effects, and ARE

Logistic Linear Model			Logistic Box-Cox Model	ARE
$q$ value	AIC	Slope (SE)	$Δ_{q}^{*}$
$q = 1$	$5390$	$- 0.210 (0.035)$	$- 0.400$	$0.475$
$q = 0.5$	$5381$	$- 0.307 (0.044)$	$- 0.322$	$0.047$
$q = 0$	$5384$	$- 0.310 (0.042)$	$- 0.323$	$0.040$

Equations79

\mbox l o g (\frac{p _{i}}{1 - p _{i}}) = β_{0} + β_{1} x_{i}^{(λ)},

\mbox l o g (\frac{p _{i}}{1 - p _{i}}) = β_{0} + β_{1} x_{i}^{(λ)},

Q_{q} = \frac{d ( \mbox l o g i t ( \mbox E ( Y ∣ X )))}{d X ^{(q)}} = β_{1} X^{λ - q} .

Q_{q} = \frac{d ( \mbox l o g i t ( \mbox E ( Y ∣ X )))}{d X ^{(q)}} = β_{1} X^{λ - q} .

Δ_{q} = \mbox E (Q_{q}) = β_{1} \mbox e x p ((λ - q) μ + \frac{( λ - q ) ^{2} σ ^{2}}{2}),

Δ_{q} = \mbox E (Q_{q}) = β_{1} \mbox e x p ((λ - q) μ + \frac{( λ - q ) ^{2} σ ^{2}}{2}),

Δ_{q}^{*} = \mbox M e d ian (Q_{q}) = β_{1} \mbox e x p ((λ - q) μ) .

Δ_{q}^{*} = \mbox M e d ian (Q_{q}) = β_{1} \mbox e x p ((λ - q) μ) .

\mbox A v a r (\hat{λ}) = \mbox O (β_{1}^{- 2}) .

\mbox A v a r (\hat{λ}) = \mbox O (β_{1}^{- 2}) .

\mbox A v a r (\hat{Δ}_{q}^{*})

\mbox A v a r (\hat{Δ}_{q}^{*})

=

W_{q}=X^{(q)}=\left\{\begin{array}[]{cc}\frac{X^{q}-1}{q}&\mbox{if }q>0,\\ \mbox{log}(X)&\mbox{if }q=0.\end{array}\right.

W_{q}=X^{(q)}=\left\{\begin{array}[]{cc}\frac{X^{q}-1}{q}&\mbox{if }q>0,\\ \mbox{log}(X)&\mbox{if }q=0.\end{array}\right.

\mbox l o g i t (\mbox P r (Y = 1 ∣ W_{q} = w_{q})) = γ_{0 q} + γ_{1 q} w_{q} .

\mbox l o g i t (\mbox P r (Y = 1 ∣ W_{q} = w_{q})) = γ_{0 q} + γ_{1 q} w_{q} .

\bm{E}\left[\left(\begin{array}[]{c}1\\ W_{q}\end{array}\right)\left(\mbox{expit}(\beta_{0}+\beta_{1}X^{(\lambda)})-\mbox{expit}(\gamma_{0q}+\gamma_{1q}W_{q})\right)\right]=\bm{0},

\bm{E}\left[\left(\begin{array}[]{c}1\\ W_{q}\end{array}\right)\left(\mbox{expit}(\beta_{0}+\beta_{1}X^{(\lambda)})-\mbox{expit}(\gamma_{0q}+\gamma_{1q}W_{q})\right)\right]=\bm{0},

\mbox A v a r (\hat{γ}_{q}) \approx J_{1}^{- 1} (\hat{γ}_{q}) V_{1} (\hat{γ}_{q}) J_{1}^{- 1} (\hat{γ}_{q}),

\mbox A v a r (\hat{γ}_{q}) \approx J_{1}^{- 1} (\hat{γ}_{q}) V_{1} (\hat{γ}_{q}) J_{1}^{- 1} (\hat{γ}_{q}),

\mbox R = \frac{\mbox P r ( Y = 1∣ X \mbox i s a t 95 \mbox - t h p er ce n t i l e )}{\mbox P r ( Y = 1∣ X \mbox i s a t 5 \mbox - t h p er ce n t i l e )} .

\mbox R = \frac{\mbox P r ( Y = 1∣ X \mbox i s a t 95 \mbox - t h p er ce n t i l e )}{\mbox P r ( Y = 1∣ X \mbox i s a t 5 \mbox - t h p er ce n t i l e )} .

\mbox A R E = \frac{γ _{1 q} - Δ _{q}^{*}}{Δ _{q}^{*}},

\mbox A R E = \frac{γ _{1 q} - Δ _{q}^{*}}{Δ _{q}^{*}},

\frac{\partial}{\partial X} [\mbox P r (Y = 1∣ X)]_{X = 1} = \frac{\partial}{\partial X} [\mbox e x p i t (β_{0} + β_{1} X^{(λ)})]_{X = 1} = \frac{β _{1} \mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}

\frac{\partial}{\partial X} [\mbox P r (Y = 1∣ X)]_{X = 1} = \frac{\partial}{\partial X} [\mbox e x p i t (β_{0} + β_{1} X^{(λ)})]_{X = 1} = \frac{β _{1} \mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}

l (β_{0}, β_{1}, λ ∣ Y, X) = i = 1 \sum n y_{i} [β_{0} + β_{1} (\frac{x _{i}^{λ} - 1}{λ})] - lo g [1 + exp (β_{0} + β_{1} (\frac{x _{i}^{λ} - 1}{λ}))]

l (β_{0}, β_{1}, λ ∣ Y, X) = i = 1 \sum n y_{i} [β_{0} + β_{1} (\frac{x _{i}^{λ} - 1}{λ})] - lo g [1 + exp (β_{0} + β_{1} (\frac{x _{i}^{λ} - 1}{λ}))]

\left(\begin{array}[]{l}\frac{\partial l}{\partial\beta_{0}}\\ \frac{\partial l}{\partial\beta_{1}}\\ \frac{\partial l}{\partial\lambda}\end{array}\right)=\left(\begin{array}[]{l}\sum\limits_{i=1}^{n}(y_{i}-p_{i})\\ \sum\limits_{i=1}^{n}(y_{i}-p_{i})\nu_{i}\\ \sum\limits_{i=1}^{n}(y_{i}-p_{i})\beta_{1}\left(\frac{x^{\lambda}_{i}\ln{x_{i}}-\nu_{i}}{\lambda}\right)\end{array}\right),

\left(\begin{array}[]{l}\frac{\partial l}{\partial\beta_{0}}\\ \frac{\partial l}{\partial\beta_{1}}\\ \frac{\partial l}{\partial\lambda}\end{array}\right)=\left(\begin{array}[]{l}\sum\limits_{i=1}^{n}(y_{i}-p_{i})\\ \sum\limits_{i=1}^{n}(y_{i}-p_{i})\nu_{i}\\ \sum\limits_{i=1}^{n}(y_{i}-p_{i})\beta_{1}\left(\frac{x^{\lambda}_{i}\ln{x_{i}}-\nu_{i}}{\lambda}\right)\end{array}\right),

H

H

det H_{11}

det H_{11}

det (H_{22})

H^{i} = p_{i} q_{i}, p_{i} q_{i} ν_{i}, p_{i} q_{i} β_{1} \frac{\partial ν _{i}}{\partial λ}, p_{i} q_{i} ν_{i}, p_{i} q_{i} ν_{i}^{2}, p_{i} q_{i} β_{1} ν_{i} \frac{\partial ν _{i}}{\partial λ} - (y_{i} - p_{i}) \frac{\partial ν _{i}}{\partial λ}, p_{i} q_{i} β_{1} \frac{\partial ν _{i}}{\partial λ} p_{i} q_{i} β_{1} ν_{i} \frac{\partial ν _{i}}{\partial λ} - (y_{i} - p_{i}) \frac{\partial ν _{i}}{\partial λ} p_{i} (1 - p_{i}) β_{1}^{2} (\frac{\partial ν _{i}}{\partial λ})^{2} - (y_{i} - p_{i}) β_{1} \frac{\partial ^{2} ν _{i}}{\partial λ ^{2}}

H^{i} = p_{i} q_{i}, p_{i} q_{i} ν_{i}, p_{i} q_{i} β_{1} \frac{\partial ν _{i}}{\partial λ}, p_{i} q_{i} ν_{i}, p_{i} q_{i} ν_{i}^{2}, p_{i} q_{i} β_{1} ν_{i} \frac{\partial ν _{i}}{\partial λ} - (y_{i} - p_{i}) \frac{\partial ν _{i}}{\partial λ}, p_{i} q_{i} β_{1} \frac{\partial ν _{i}}{\partial λ} p_{i} q_{i} β_{1} ν_{i} \frac{\partial ν _{i}}{\partial λ} - (y_{i} - p_{i}) \frac{\partial ν _{i}}{\partial λ} p_{i} (1 - p_{i}) β_{1}^{2} (\frac{\partial ν _{i}}{\partial λ})^{2} - (y_{i} - p_{i}) β_{1} \frac{\partial ^{2} ν _{i}}{\partial λ ^{2}}

det (H_{1 \times 1}^{i})

det (H_{1 \times 1}^{i})

det (H_{2 \times 2}^{i})

det (H_{3 \times 3}^{i})

\displaystyle I_{1}\left(\begin{array}[]{c}\beta_{0}\\ \beta_{1}\\ \lambda\end{array}\right)

\displaystyle I_{1}\left(\begin{array}[]{c}\beta_{0}\\ \beta_{1}\\ \lambda\end{array}\right)

=

\mbox A v a r (\hat{λ}) = \frac{1}{\mbox d e t [ I _{1} ]} \mbox C_{33} .

\mbox A v a r (\hat{λ}) = \frac{1}{\mbox d e t [ I _{1} ]} \mbox C_{33} .

β_{1} \to 0 lim p (x) = \frac{\mbox e x p ( β _{0} )}{1 + \mbox e x p ( β _{0} )},

β_{1} \to 0 lim p (x) = \frac{\mbox e x p ( β _{0} )}{1 + \mbox e x p ( β _{0} )},

β_{1} \to 0 lim \mbox C_{33} = [\frac{\mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}]^{2} \frac{\mbox e x p ( 2 λ μ ) [ \mbox e x p ( λ ^{2} σ ^{2} ) - \mbox e x p ( \frac{λ ^{2} σ ^{2}}{2} ) ]}{λ ^{2}},

β_{1} \to 0 lim \mbox C_{33} = [\frac{\mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}]^{2} \frac{\mbox e x p ( 2 λ μ ) [ \mbox e x p ( λ ^{2} σ ^{2} ) - \mbox e x p ( \frac{λ ^{2} σ ^{2}}{2} ) ]}{λ ^{2}},

β_{1} \to 0 lim β_{1}^{2} \mbox d e t (I_{1} (θ)) = [\frac{\mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}]^{3} \frac{\mbox e x p ( \frac{3 λ ^{2} σ ^{2} + 8 λ μ}{2} ) ( \mbox e x p ( \frac{λ ^{2} σ ^{2}}{2} ) - \frac{λ ^{2} σ ^{2}}{2} - 1 ) σ ^{2}}{2 λ ^{4}}

β_{1} \to 0 lim β_{1}^{2} \mbox d e t (I_{1} (θ)) = [\frac{\mbox e x p ( β _{0} )}{( 1 + \mbox e x p ( β _{0} ) ) ^{2}}]^{3} \frac{\mbox e x p ( \frac{3 λ ^{2} σ ^{2} + 8 λ μ}{2} ) ( \mbox e x p ( \frac{λ ^{2} σ ^{2}}{2} ) - \frac{λ ^{2} σ ^{2}}{2} - 1 ) σ ^{2}}{2 λ ^{4}}

\int_{- \infty}^{+ \infty} e^{- t^{2}} f (t) d t \approx i = 1 \sum m w_{i} f (t_{i})

\int_{- \infty}^{+ \infty} e^{- t^{2}} f (t) d t \approx i = 1 \sum m w_{i} f (t_{i})

w_{i} = \frac{2 ^{m - 1} m ! π}{m ^{2} [ H _{m - 1} ( x _{i} ) ] ^{2}} .

w_{i} = \frac{2 ^{m - 1} m ! π}{m ^{2} [ H _{m - 1} ( x _{i} ) ] ^{2}} .

\nabla l_{1}(\bm{\gamma}_{q})=\left(\begin{array}[]{l}\frac{\partial l}{\partial\gamma_{0q}}\\ \frac{\partial l}{\partial\gamma_{1q}}\end{array}\right)=\left(\begin{array}[]{l}Y-P^{*}\\ (Y-P^{*})W_{q}\end{array}\right),

\nabla l_{1}(\bm{\gamma}_{q})=\left(\begin{array}[]{l}\frac{\partial l}{\partial\gamma_{0q}}\\ \frac{\partial l}{\partial\gamma_{1q}}\end{array}\right)=\left(\begin{array}[]{l}Y-P^{*}\\ (Y-P^{*})W_{q}\end{array}\right),

V_{1} (γ_{q})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection

Full text

Logistic Box-Cox Regression to Assess the Shape and Median Effect under Uncertainty about Model Specification

Li Xing

Mathematics and Statistics

University of Victoria

Victoria, BC, Canada

&Xuekui Zhang

Mathematics and Statistics

University of Victoria

Victoria, BC, Canada

&Igor Burstyn

Department of Environmental and Occupational Health

School of Public Health

Drexel University

Philadelphia, PA, USA

&Paul Gustafson

Department of Statistics

University of British Columbia

Vancouver, BC, Canada Correspondence to Paul Gustafson, Department of Statistics, University of British Columbia, 3182 Earth Sciences Building, 2207 Main Mall, Vancouver, BC, Canada, V6T 1Z4. Email: [email protected]

Abstract

The shape of the relationship between a continuous exposure variable and a binary disease variable is often central to epidemiologic investigations. This paper investigates a number of issues surrounding inference and the shape of the relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the median effect of exposure, and investigate how precisely this can be inferred. This is done both in the case of using a model acknowledging uncertainty about the shape parameter and in the case of ignoring this uncertainty and using a two-step method, where in step one we transform the predictor and in step two we fit a simple linear model with transformed predictor. All these investigations require a family of exposure-disease relationships indexed by a shape parameter. For this purpose, we employ a family based on the Box-Cox transformation.

K****eywords Shape of the Exposure-Disease Relationship $\cdot$ Median Predictive Effect $\cdot$ Factorial Design $\cdot$ Misspecified Model $\cdot$ Logistic Box-Cox Model $\cdot$ Quasi-Newton Method

1 Introduction

Epidemiologists are often confronted with skewed distribution of exposure or dose-metrics (such as cumulative exposure) that is suspected to be related in a non-linear fashion with commonly employed functions of risk of a health outcome, such as is afforded by logistic regression. For instance, there may be saturation and threshold effects, as well as reversals of direction of association at different doses (e.g. drinking and heart health reported by Doll et al [6]). Therefore, the underlying assumption in the logistic regression about the linearity between the log-odds of disease and exposure may not be valid. As a remedy, researchers transform the exposure measurements using a logarithmic or square-root function and then plug the transformed measurements into a logistic model as the predictor. This data transformation step before model-fitting aims to make the relationship between the log-odds and the transformed exposure closer to linear. However, such two-step approach ignores the uncertainty in the nonlinear association by enforcing a logarithm or square root function, which lacks a theoretical justification for the choice of transformation function. Therefore, we build a parsimonious one-step model for two purposes. First we estimate a shape parameter in the model based on the maximum likelihood (ML) estimation from the data and this shape parameter shows the most likely nonlinear association type. Second the estimated shape parameter is an optimal transformation, which, in practice, provides the theoretical justification for the type of transformation for those who prefer the two step approach. Our discussion focuses on the general risk model for the association between a binary disease outcome, Y , and a continuous exposure variable, X. Assume $X\sim\mbox{LN}(\mu,\sigma^{2})$ as is often realistic for environmental exposures [19]. For the $i$ -th subject, we have

[TABLE]

where $p_{i}=\mbox{E}(Y_{i}|X_{i}=x_{i})$ , $x_{i}^{(\lambda)}=\left(x_{i}^{\lambda}-1\right)/\lambda$ for $\lambda>0$ , $x_{i}^{(\lambda)}=\mbox{log}(x_{i})$ for $\lambda=0$ , and $\lambda\geq 0$ . The $x^{(\lambda)}$ function is a Box-Cox transformation [3]. Statistical models involving the Box-Cox transformation are discussed extensively in literature. In linear regression, due to requirement of normality of residuals, the Box-Cox transformation is often employed on the outcome variable [16, 5, 15, 3]. In both linear and logistic regressions, in order to satisfy linearity requirement between log-odds and predictors, the Box-Cox transformation is suggested for predictors [24, 18, 7]. We emphasize two desirable properties of the Box-Cox function: the continuity at $\lambda=0$ and ability to accommodate several familiar transformations ( i.e., the logarithm function at $\lambda=0$ , the linear function at $\lambda=1$ , the square-root function at $\lambda=0.5$ , and the square function at $\lambda=2$ ).

As a nonlinear model, the gradient of the log-odds of the logistic Box-Cox model is no longer constant. We are interested in this gradient with respect to $X^{(q)}$ for some choice of $q$ . Particularly, we define

[TABLE]

The quantity $Q_{q}$ represents the instantaneous effect of the predictor on the $X^{(q)}$ scale. In the Box-Cox model, if we can correctly specify $q=\lambda$ , $Q_{q}(=\beta_{1})$ represents a constant effect on the $X^{(\lambda)}$ scale. For $q\neq\lambda$ , the value of $Q_{q}$ changes over $X$ representing a non-constant effect on $X^{(q)}$ . More specifically, $Q_{q}/\beta_{1}$ follows $\mbox{LN}((\lambda-q)\mu,(\lambda-q)^{2}\sigma^{2})$ . Gelman and Pardoe [10] suggested averaging the effect of a predictor over the population distribution of predictors. Examples are shown in linear regression models [21] and in the survival analysis context [13]. We adapt their definitions to the logistic regression context to arrive at a summary of $Q_{q}$ . We define the average effect, $\Delta_{q}$ , and the median effect, $\Delta^{*}_{q}$ , as summary measurements of the effect of the predictor $X$ on the $X^{(q)}$ scale in the following.

[TABLE]

and

[TABLE]

Because median is more robust than mean for a long-tailed distribution, going forward we adopt the median effect, $\Delta^{*}_{q}$ , as the representative of the overall gradient of the log-odds.

In Section 2, we provide two propositions on the MLE of the logistic Box-Cox model and propose an optimization algorithm to obtain the MLE. In Section 3, we discuss the misspecified logistic linear model and define a quantity to measure the distance between the median effect and the slope coefficient estimated from the misspecified model. In Section 4, we design and conduct simulation studies to evaluate the accuracy of the parameter estimates of the logistic Box-Cox model based on the quasi-Newton method, to compare the median effect and its approximation from a simple linear model with transformed predictor, and to calculate the asymptotic standard deviations of the model parameter estimates as well as that of the estimated median effect. In Section 5, we apply our model to a real data set and compare this model with three two-step approaches. In Section 6, we summarize our results and draw conclusions.

2 The Logistic Box-Cox Model

In this section, we prove that the log-likelihood function of the logistic Box-Cox model is strictly concave. So to obtain MLE, we only need to find the root of the score function. Based on this property and optimization methods for this model in the literature, we use the quasi-Newton algorithm to compute the MLE. In addition, we use numerical methods to approximate the asymptotic variance of the MLE, which help us understand the precision of the parameter estimates under large samples and also help design our future experiments.

Proposition 1

The Hessian matrix of the log-likelihood function of the logistic Box-Cox model is negative definite for any interior point in the three dimensional space $(-\infty,+\infty)\times(-\infty,+\infty)\times[0,+\infty)$ .

Corollary 2.0.1

The log-likelihood function is strictly concave and, therefore, any root of the score function is the unique global maximum of the likelihood function.

The proof of the Proposition 1 is given in the appendix and the proof of the corollary is trivial. In the literature, there are two kinds of optimization methods for this model. Egger [7] mentioned the difficulty of convergence for the Newton-Raphson method in practice and suggested using the profile likelihood (PL) method, where we do a grid search on the shape parameter $\lambda$ , use iteratively re-weighted least squares to estimate the regression coefficients, $\beta_{0}$ and $\beta_{1}$ , given each fixed $\lambda$ , and choose the set of estimates based on ML. Guerrero et al [12] suggested a quasi-Newton method to estimate the parameters of the logistic Box-Cox model. Different from the Newton-Raphson method, in the quasi-Newton method, we replace the inverse of the Hessian matrix by an approximation in each iteration. This can reduce the numerical non-stability in getting the inverse of a matrix. As the log-likelihood has such nice properties, we choose the quasi-Newton method, but use the PL method to obtain the initial points. Particularly, the quasi-Newton method that we employ is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization method ([8, 4, 11, 23]), which has been written in a wrapper function in the r package, maxLik [14].

Proposition 2

[TABLE]

The proof of Proposition 2 is also given in the appendix. This proposition demonstrates that, under a weak association between the predictor and outcome variables (i.e. small value of $\beta_{1}$ ), in order to get a precise estimate of the shape parameter, we need a large sample size as we know that $\mbox{Var}(\hat{\lambda})\approx\mbox{Avar}(\hat{\lambda})/n$ . We calculate the asymptotic variance of the model parameters based on inverse of the Fisher information matrix through numerical methods, where details are in the appendix, and we calculate the asymptotic variance of the median effect, $\Delta^{*}_{q}$ , based on the multivariate delta method listed below.

[TABLE]

3 The Misspecified Logistic Linear Model

Assume that the true model is a logistic Box-Cox model. We are interested in the bias incurred if we fit a misspecified logistic linear model with a Box-Cox transformed $X$ as a predictor. In the misspecified model, the type of Box-Cox transformation is given beforehand, which means the shape parameter, $q$ , is a fixed constant. We denote the transformed predictor as $W_{q}$ , where

[TABLE]

The misspecified model is written as below.

[TABLE]

To obtain the large-sample limit of the estimated coefficients, $(\hat{\gamma}_{0q},\hat{\gamma}_{1q})$ , we need to solve the following equations:

[TABLE]

where $\mbox{expit}(\cdot)=\mbox{exp}(\cdot)/(1+\mbox{exp}(\cdot))$ . Due to the misspecified likelihood, the inverse of the Fisher Information matrix is no longer providing the asymptotic variances of the parameter estimates. Therefore, we use the sandwich type estimates [25, 9] for the variance estimates, whereby

[TABLE]

where $J_{1}=\mbox{E}(H(l_{1}))$ with $l_{1}$ representing the likelihood function of the model ( 13) and $H(l_{1})$ representing the Hessian matrix, $V_{1}=\mbox{Var}(\nabla l_{1})$ , and $\hat{\bm{\gamma}}_{q}=(\hat{\gamma}_{0q},\hat{\gamma}_{1q})^{T}$ is the solution of ( 14). More detailed mathematical work is provided in the appendix.

4 Simulation Studies

In the simulation studies, our aims are three-fold: (1) evaluating the accuracy of the parameter estimates in the logistic Box-Cox model based on the BFGS method, (2) comparing the distance between the median effect from the underlying logistic Box-Cox model with its approximation, the large sample limit of the estimate of the slope parameter from the misspecified linear model, and (3) calculating the asymptotic standard deviations of the model parameter estimates as well as that of the estimated median effect. To achieve these aims we design a factorial experiment, using factors whose levels reflect plausible contexts for epidemiologic investigations.

4.1 Simulation Design

We choose four factors to control our simulation experiment, which are:

the shape of exposure distribution; 2. 2.

the shape of exposure-disease relationship; 3. 3.

the disease rarity; 4. 4.

the strength of exposure-disease association.

Table 1 shows us the levels of each factor. First, without loss of generality, we fix the $95$ -th percentile of the distribution of $X$ at $1$ and vary $\sigma$ to control the level of skewness of the distribution of $X$ . Second, we vary the shape parameter, $\lambda$ , as $0,0.5,1$ and $2$ , which corresponds to log, square-root, linear and square functions respectively. Third, we use the probability of disease at the $5$ -th percentile of the exposure to indicate the disease rarity, varying this as $\mbox{P}_{1}=0.02$ and $\mbox{P}_{2}=0.1$ . Fourth, we consider the ratio of the probability of the disease at $95$ -th percentile of the exposure to the probability of the disease at $5$ -th percentile, which is denoted as

[TABLE]

We let $\mbox{R}_{1}=1.1,\mbox{R}_{2}=2,$ and $\mbox{R}_{3}=5$ to represent weak, medium and strong associations respectively.

Figure 1 shows the disease risk as a function of the exposure in the described 72 settings. In each panel, the distribution of exposure and the disease rarity is fixed. The risk functions vary with the shape parameters in the model and the risk ratios indicating the strength of the association. We can see that given the distribution of exposure, as the exposure-disease association becomes stronger, the risk differences between different shape parameters at the same exposure level become larger. Also, given the association, as the distribution becomes more skewed, the risk differences between different shape parameters at the same exposure level become larger. These indicate that the skewness and the strength of association may be related to the precision in estimating the shape parameter. Also this figure illustrates the magnitude of the risk and the gradient of the log-odds with respect to $X$ under different experimental settings. This can help us understand the real data under the similar conditions and also guide our future experiments.

4.2 Simulation Results

4.2.1 Aim 1: Evaluation of the BFGS method

In this simulation, under each setting, we generate $500$ data sets, for each of which we generate $5000$ $X$ ’s as $\mbox{LN}(\mu,\sigma^{2})$ , and the corresponding $Y$ ’s from the Bernoulli distribution with probability $\mbox{P}=\mbox{expit}(\beta_{0}+\beta_{1}X^{(\lambda)})$ . For each data set, we apply the PL method firstly, and use the estimates of the PL method as the initial points for the BFGS method.

Figure 2 demonstrates that the ML estimation implemented with the BFGS algorithm provides fairly accurate estimation of $\lambda$ when the exposure variable and disease outcome have some degree of association. When their association is very weak, the bias and RMSE are considerably larger. However, with the medium level of association the bias is much smaller than $0.5$ . This suggests we can easily distinguish a linear transformation from a square-root one or a square-root one from a log one. The stronger the association is the more accurate the estimates are. The results confirm that the BFGS method works well for our model fitting. The figures for bias and RMSE in estimating other parameters are provided in the appendix.

4.2.2 Aim 2: The Gradient Measurement of the Logistic Box-Cox Model and Its Estimate

In the logistic Box-Cox model, we use the median effect, $\Delta^{*}_{q}$ , to represent the gradient of the log-odds with respect to $X^{(q)}$ scale. We hope that if the sample is large enough, the estimate of the slope coefficient, $\hat{\gamma}_{1q}$ from the misspecified logistic linear model with $X^{(q)}$ as the predictor can be a good approximation of $\Delta^{*}_{q}$ . Therefore, we define the absolute relative error (ARE) to measure the difference between the large sample limit, $\gamma_{1q}$ , and $\Delta^{*}_{q}$ . That is,

[TABLE]

where $\gamma_{1q}$ is defined in the misspecified model ( 13). Under each of the $72$ settings, we fix the level of the other three factors and let the value of $\lambda$ vary from [math] to $2$ with an increment of $0.25$ . And for each $\lambda$ , we generate $50,000$ of $X$ ’s from $\mbox{LN}(\mu,\sigma^{2})$ and then we have the corresponding probability $\mbox{P}=\mbox{expit}(\beta_{0}+\beta_{1}X^{(\lambda)})$ . We vary $q$ from [math] to $2$ with an increment of $0.25$ . For each $q$ , we approximate expectation of the functions in equation ( 14) by their sample mean from samples of the $50,000$ of $X$ ’s and then solve the equations to get the limiting coefficient, $\gamma_{1q}$ . As we have the true value of $\Delta^{*}_{q}$ , we get the numerically approximated AREs as a function of $\lambda$ and $q$ under each setting.

Figure 3 shows ARE as a function of $\lambda$ and $q$ under the settings with rare disease, weak association, and medium skewed distribution. The pattern is similar for all the settings. We can see that for each $\lambda$ , when $q$ approaches $\lambda$ from the right side, ARE monotonically decreases and the rate of decrease is close to constant. When $q$ approaches $\lambda$ from the left side, ARE behaves like a quadratic function, with a maximum point between [math] and $\lambda$ . Therefore, to get smaller ARE, we suggest that it is safer to guess $q=0$ . Though the patterns are similar among all settings, the ARE inflates under conditions of strong association, common disease rarity, and more skewed distribution of $X$ .

4.2.3 Aim 3: Calculation of the Asymptotic Standard Deviations

Without loss of generality, we calculate the asymptotic standard deviation (ASD) of $\hat{\lambda}$ for a dataset with one observation based on numerical approximation of the inverse of the expected Fisher information matrix described in the appendix.

Figure 4 demonstrates that $\mbox{ASD}(\hat{\lambda})$ decreases when the association becomes stronger, when the disease becomes more common, or when the predictor is more skewed given other conditions do not change. Particularly, under a weak association, $\mbox{R}_{1}$ , and a rare disease situation, $\mbox{P}_{1}$ , $\mbox{ASD}(\hat{\lambda}$ ) is much larger than in the other situations, which indicates that when information is weak, it is harder to determine of the value of $\lambda$ . The numerically approximated $\mbox{ASD}(\hat{\lambda})$ can help us design future studies. For example, if we would like to detect the difference of $0.5$ in the estimate of $\lambda$ in order to distinguish between a logarithm transformation and a square-root transformation, the standard error (SE) of $\hat{\lambda}$ should be less than $0.125$ . We can achieve this by adding more samples. Under $\mbox{P}_{1},\mbox{R}_{1},\lambda=0$ and the weakly skewed exposure, $\mbox{ASD}(\hat{\lambda})\approx 700$ so that the sample size required to make $\mbox{SE}(\hat{\lambda})=0.125$ is equal to $(700/0.125)^{2}=31,360,000$ . Therefore, any sample size larger than $31,360,000$ can provide us the power to distinguish the difference of $0.5$ in the estimate of $\lambda$ under the weakest condition, while this requirement decreases to less than one fourth of the big number when the condition changes to $\mbox{P}_{2}$ and others maintain the same. Note that in virtually all cases, it is not feasible to recruit around 30 millions participants in a study. To achieve this precision, the least requirement of the sample size among all of the settings of consideration is only $(2.914/0.125)^{2}=544$ , which is under $\mbox{P}_{2},\mbox{R}_{2},\lambda=0,$ and $\sigma=2$ .

We also calculate $\mbox{ASD}(\hat{\gamma}_{1q})$ for a single-observation dataset based on the sandwich method for the misspecified likelihood and numerical methods, as discussed in Section 3 and the appendix. In addition, we vary $q=0,0.5,1,$ and $2$ to understand the difference across $q$ .

Figure 5 illustrates that $\mbox{ASD}(\hat{\gamma}_{1q})$ inflates under two extreme conditions. One is under weak association, rare disease, mild skewed distribution of $X$ and $q=2$ . The other is under strong association, common disease, and $q=2$ . When there is weak association, rare disease and mild skewed condition, we can not get a precise estimate of the slope based on the misspecified linear model on any of the examined scales of $X$ . On the other side, when there is strong association and common disease, we can not get a precise estimate of the slope if we enforce a linear pattern on a square scale. In general, $\mbox{ASD}(\hat{\gamma}_{1q})$ with $q=0$ is relatively low under all situations, though the precision worsen slightly when $\lambda$ is further from [math]. This implies that when there is little information, a logarithm transformation is a safer guess.

Finally, we calculate $\hat{\Delta}^{*}_{q}$ based on the multivariate delta method.

Figure 6 illustrates that under all experimental settings, $\mbox{ASD}(\hat{\Delta}^{*}_{q})$ is monotonically increasing as a function of $q$ . This makes sense since when $q$ becomes smaller, $\mbox{ASD}(\hat{\Delta}^{*}_{q})$ shows the gradient at a slower changing scale. Therefore, $\mbox{ASD}(\hat{\Delta}^{*}_{0})$ is always the smallest for each setting, which indicates precise estimation of the median effect on the log scale.

5 Application

We analyze data from the National Health and Nutrition Examination Survey (NHANES) $2009$ - $2010$ , which involves $9,781$ adults aged $40$ years and above with measurements of both total blood mercury and depression. The exposure variable, $X$ , is the total blood mercury in microgram per liter (ug/L), and the binary outcome, $Y$ , is dichotomized from the score of the Patient Health Questionnaire- $9$ (PHQ- $9$ ) with [math] indicating no depression (PHQ- $9$ score $\leq 9$ ) and $1$ indicating depression (PHQ- $9$ score $\geq 10$ ).

Shown in Figure 7, the total blood mercury is right-skewed. And its distribution is approximated by the log-normal with $\hat{\mu}=-0.12$ and $\hat{\sigma}=0.93$ . We fit the logistic Box-Cox model relating the blood mercury level to prevalence of depression. The estimated parameters are $\hat{\beta}_{0}=-2.469$ (SE $0.046$ ), $\hat{\beta}_{1}=-0.317$ (SE $0.046$ ), and $\hat{\lambda}=0.392$ (SE $0.191$ ). Therefore, we can see that the estimated prevalence of depression at $X=0$ is $\hat{\mbox{Pr}}(Y=1|X=0)=\mbox{expit}(\hat{\beta}_{0}+\hat{\beta}_{1}X^{(\hat{\lambda})})|_{X=0}=16\%$ . And the instantaneous risk decline rate of prevalence of depression at certain level of the total blood mercury can be calculated. For instance, at $X=1$ , this rate is

[TABLE]

Plugging in the estimated coefficients, we get its estimate, $-0.022.$ The estimated median effect on $X$ , $\hat{\Delta}^{*}_{1}$ , is $-0.613$ with the $95\%$ point-wise confidence interval $[-0.978,-0.279]$ , showing an overall negative association between the total blood mercury and depression. Next we would like to compare this fitted Box-Cox model with the misspecified linear model on $X^{(q)}$ scale for $q=0,0.5$ and $1$ . In Table 2, we include the estimated slope coefficients and the Akaike information criterion (AIC) of the misspecified models, the corresponding estimated median effects from the logistic Box-Cox model, and the estimated ARE between the estimated median effect and the estimated slope coefficient from the data. When $q=0.5$ , we have the minimum AIC, which suggests the square-root model is the best among the three misspecified models. On the other hand, if we look at ARE, the smallest ARE occurs between the log model and the logistic Box-Cox model.

To illustrate the local pattern of the relationship between the total blood mercury level and depression, we use a three-step procedure. First, we sort the data based on the blood mercury level from small to large. Second, we bin every 500 samples based on this order, with the last group contains all the remaining $781$ samples. Third, we plot the observed risk over the range of the blood mercury level in Figure 8. From the local pattern, we see that the overall decrease of the risk associated with the increase of the total blood mercury level. The curves of the estimated risks from the logistic Box-Cox model and the three misspecified logistic linear models over the range of the total blood mercury are also added in Figure 8, showing that the estimated risks of the square-root model are closest to those of the logistic Box-Cox model.

In addition to the graphical illustration, we compare the goodness of fit of the four models and also compare their predictions. We conduct the Hosemer and Lemeshow goodness of fit (GOF) test [17] for the four models. This test statistic is the sum of the difference between the expected and the observed risks over pre-defined subgroups. To avoid the result depending on the number of subgroups, we vary the number of subgroups from $5$ to $12$ and for each partition of subgroups, we conduct the test. The resulting p-values are reported in Figure 9, which demonstrates that the square-root model is comparable to the Box-Cox model, while the logarithmic model is the worst in terms of goodness of fit.

We also compare the predictions of the four models using 10-fold cross-validation, where we split the total samples equally into ten subgroups. Nine of the $10$ subgroups are combined as the training set that we use to fit the model, while the remaining one is the test set that we use for prediction based on the fitted model from the training set. When we iterate over all the possible combinations of nine subgroups, the predicted risks from all the test sets become the prediction on all of the samples. We use r package caret [20] for the data partitions, since its functions generate random samples within the level of the outcome and, therefore, the splits have the balanced class distributions. To compare the predictions, since all of the models have the same receiver operating characteristic (ROC) curve, we use two criteria, the mean absolute error and the mean squared error of the estimated risks. The mean absolute errors from the logistic Box-Cox model, the linear, the square-root and the log models are $0.1465,0.1466,0.1465,$ and $0.1466$ , while their mean square errors are $0.0736,0.0736,0.0736,$ and $0.0737$ . The errors from different models are close to each other, which is mainly due to the low exposure-disease association, ( the estimated risk ratio between the $95$ th percentile and the $5$ th percentile = $0.31$ ), (Refer to the (2, 2) panel of Figure 1). In summary, we conclude that the square-root model is comparable to the logistic Box-Cox model, and both outperform the log model. It is important to note that our analysis of NHANES data was not meant to illuminate association between mercury and depression, as it is most likely confounded to the degree that makes it impossible to argue that mercury protects against depression [22].

6 Discussion

The logistic Box-Cox model is a formal method, which can accommodate the non-linear relationship between the log-odds and exposure via a shape parameter. The estimate of this parameter is determined based on the ML method. Particularly, we discuss the profile likelihood (PL) and the quasi-Newton methods. The profile likelihood can might lead to a local maximum solution. The quasi-Newton method targets the global maximum, but it is sensitive to the initial point. We recommend the PL method to provide the initial values for the quasi-Newton to guarantee a good starting point. In this way, we borrow strength from both methods in an attempt to obtain a superior overall approach.

As a non-linear model, the gradient of the log-odds with respect to the predictor is not constant. This encourages us to define the median effect, which represents the gradient over the entire distribution of the predictor. We generalize this quantity to the $X^{(q)}$ scale. In this way, we can compare it with the slope from the misspecified model based on the power transformation of $q$ . The ARE is a measure of the distance between the large sample limiting value of the slope estimate and the median effect on the same scale. We see that even when model is misspecified, when there is little information, the slope estimated from the log transformation is can be close to the median effect relative to the magnitude of the median effect.

We calculate the asymptotic standard deviation of the estimate of the shape parameter, that of the estimated slope parameter given a certain scale, and that of the estimated median effect given a certain scale based on numerical methods. These quantities can help us design future studies. For instance, if we have prior knowledge about nonlinear relationship and skewed exposure, we can estimate the required sample size based on the desired accuracy for the estimate of the shape parameter. For the conducted studies with limited sample size, if disease is rare and association is not strong, the logarithm transformation provides stable measurement since now the more complex logistic Box-Cox model is less helpful due to the large estimated uncertainty on the parameter estimates.

7 Acknowledgements

This research is supported by NSERC through the Discovery Grants program, through the Canada Research Chair program, and through the NSERC Postdoctoral Fellowships Program and by the University of Victoria through a UVic Internal Research Grant and the UVic Faculty of Science.

References

[1]

Bartle, R.G.,

The elements of integration and Lebesgue measure. Wiley Interscience. (1995).

[2]

Abramowitz, M and Stegun, I A,

Handbook of Mathematical Functions, 10th printing with corrections. Dover (1972).

Additional Figures

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions, 10th printing with corrections . Dover, 1972.
2[2] R. Bartle. The elements of integration and Lebesgue measure. Wiley Interscience, 1995.
3[3] G. Box and D. Cox. An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological) , 26(2):211–252, 1964.
4[4] C. G. Broyden. The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA Journal of Applied Mathematics , 6(1):76–90, 1970.
5[5] R. J. Carroll and D. Ruppert. On prediction and the power transformation family. Biometrika , 68(3):609–615, 1981.
6[6] R. Doll, R. Peto, E. Hall, K. Wheatley, and R. Gray. Mortality in relation to consumption of alcohol: 13 years’ observations on male british doctors. BMJ , 309(6959):911–918, 1994.
7[7] M. J. Egger. Power transformation to achieve symmetry in quantal bioassays. Technical Report Technical Report 47, Stanford University, Division of Biostatistics, 1979.
8[8] R. Fletcher. A new approach to variable metric algorithms. The Computer Journal , 13(3):317–322, Jan. 1970.