Personalized Two-sided Dose Interval
Chan Park, Guanhua Chen, Menggang Yu

TL;DR
This paper introduces a novel, theoretically justified method for estimating personalized two-sided dose intervals that are robust and outperform existing approaches, with applications in medicine and social sciences.
Contribution
It proposes a direct empirical risk minimization approach with a new loss function for personalized dose intervals, eliminating iterative procedures and ensuring statistical robustness.
Findings
Method outperforms existing approaches in simulations.
Loss function is doubly-robust to misspecification.
Applications include warfarin dosing and social program analysis.
Abstract
In fields such as medicine and social sciences, the goal of treatment is often to maintain the outcome of interest within a desirable range rather than to optimize its value. To achieve this, it may be more practical to recommend a treatment dose interval rather than a single fixed level for a study unit. Since individuals may respond differently to the same treatment level, the recommended dose interval should be personalized based on their unique characteristics. Iterative procedures have been proposed to jointly learn the lower and upper bounds of personalized dose intervals, but they lack theoretical justification. To fill this gap, we propose a method to learn personalized two-sided dose intervals based on empirical risk minimization using a novel loss function. The proposed loss function is designed to be well-defined over a tensor product function space, eliminating the need for…
| Estimator | Invalid PDI | MAE | MSE | Accuracy | F1 Score | MCC | Recall | Precision | Cohen’s kappa | |
| 0.70 | Ind-Para | 0.050 | 1.439 | 1.747 | 0.772 | 0.834 | 0.475 | 0.805 | 0.864 | 0.472 |
| Ind-RF | 0.073 | 1.358 | 2.016 | 0.754 | 0.816 | 0.461 | 0.766 | 0.873 | 0.452 | |
| Ind-Ens | 0.052 | 1.298 | 1.784 | 0.760 | 0.821 | 0.469 | 0.775 | 0.873 | 0.461 | |
| D-CW | 0.000 | 1.342 | 1.692 | 0.772 | 0.838 | 0.452 | 0.831 | 0.845 | 0.452 | |
| D-Joint | 0.000 | 1.283 | 1.502 | 0.783 | 0.845 | 0.485 | 0.833 | 0.857 | 0.484 | |
| 0.72 | Ind-Para | 0.065 | 1.413 | 1.670 | 0.761 | 0.822 | 0.466 | 0.780 | 0.869 | 0.460 |
| Ind-RF | 0.091 | 1.356 | 1.989 | 0.740 | 0.802 | 0.448 | 0.739 | 0.876 | 0.434 | |
| Ind-Ens | 0.066 | 1.290 | 1.752 | 0.748 | 0.809 | 0.458 | 0.751 | 0.877 | 0.446 | |
| D-CW | 0.000 | 1.303 | 1.573 | 0.763 | 0.829 | 0.448 | 0.808 | 0.851 | 0.446 | |
| D-Joint | 0.000 | 1.247 | 1.387 | 0.776 | 0.837 | 0.483 | 0.811 | 0.866 | 0.481 | |
| 0.74 | Ind-Para | 0.084 | 1.392 | 1.607 | 0.748 | 0.809 | 0.455 | 0.753 | 0.874 | 0.444 |
| Ind-RF | 0.117 | 1.362 | 1.976 | 0.723 | 0.783 | 0.429 | 0.707 | 0.879 | 0.410 | |
| Ind-Ens | 0.084 | 1.273 | 1.654 | 0.733 | 0.794 | 0.443 | 0.724 | 0.879 | 0.426 | |
| D-CW | 0.000 | 1.248 | 1.435 | 0.754 | 0.819 | 0.442 | 0.785 | 0.857 | 0.438 | |
| D-Joint | 0.000 | 1.216 | 1.316 | 0.763 | 0.825 | 0.470 | 0.784 | 0.870 | 0.464 | |
| 0.76 | Ind-Para | 0.105 | 1.377 | 1.559 | 0.732 | 0.792 | 0.440 | 0.721 | 0.879 | 0.423 |
| Ind-RF | 0.144 | 1.376 | 1.988 | 0.703 | 0.763 | 0.412 | 0.672 | 0.882 | 0.386 | |
| Ind-Ens | 0.108 | 1.283 | 1.674 | 0.718 | 0.778 | 0.430 | 0.694 | 0.884 | 0.408 | |
| D-CW | 0.000 | 1.200 | 1.311 | 0.743 | 0.807 | 0.435 | 0.757 | 0.863 | 0.427 | |
| D-Joint | 0.000 | 1.150 | 1.153 | 0.753 | 0.813 | 0.463 | 0.759 | 0.876 | 0.452 | |
| 0.78 | Ind-Para | 0.133 | 1.369 | 1.530 | 0.712 | 0.772 | 0.423 | 0.685 | 0.883 | 0.399 |
| Ind-RF | 0.182 | 1.403 | 2.009 | 0.682 | 0.739 | 0.394 | 0.633 | 0.886 | 0.360 | |
| Ind-Ens | 0.136 | 1.290 | 1.656 | 0.697 | 0.755 | 0.411 | 0.658 | 0.887 | 0.381 | |
| D-CW | 0.000 | 1.161 | 1.203 | 0.726 | 0.790 | 0.421 | 0.724 | 0.868 | 0.407 | |
| D-Joint | 0.000 | 1.132 | 1.116 | 0.734 | 0.794 | 0.445 | 0.723 | 0.880 | 0.428 | |
| 0.80 | Ind-Para | 0.166 | 1.369 | 1.518 | 0.689 | 0.746 | 0.403 | 0.644 | 0.887 | 0.370 |
| Ind-RF | 0.223 | 1.434 | 2.061 | 0.655 | 0.707 | 0.372 | 0.587 | 0.890 | 0.328 | |
| Ind-Ens | 0.173 | 1.306 | 1.655 | 0.675 | 0.729 | 0.391 | 0.618 | 0.890 | 0.353 | |
| D-CW | 0.000 | 1.132 | 1.107 | 0.709 | 0.771 | 0.405 | 0.692 | 0.871 | 0.385 | |
| D-Joint | 0.000 | 1.084 | 0.989 | 0.716 | 0.776 | 0.427 | 0.693 | 0.883 | 0.404 |
| Estimator | Invalid PDI (%) | Accuracy | F1 Score | MCC | Recall | Precision | Cohen’s kappa | |
| 0.60 | (Ind-Para) | 0.002 | 0.760 | 0.858 | 0.149 | 0.938 | 0.789 | 0.126 |
| (D-Joint) | 0.000 | 0.767 | 0.864 | 0.151 | 0.957 | 0.787 | 0.116 | |
| 0.62 | (Ind-Para) | 0.002 | 0.756 | 0.854 | 0.160 | 0.924 | 0.793 | 0.143 |
| (D-Joint) | 0.000 | 0.765 | 0.861 | 0.166 | 0.945 | 0.791 | 0.138 | |
| 0.64 | (Ind-Para) | 0.005 | 0.750 | 0.848 | 0.169 | 0.906 | 0.797 | 0.158 |
| (D-Joint) | 0.000 | 0.758 | 0.855 | 0.171 | 0.924 | 0.795 | 0.154 | |
| 0.66 | (Ind-Para) | 0.007 | 0.741 | 0.840 | 0.173 | 0.884 | 0.801 | 0.167 |
| (D-Joint) | 0.000 | 0.748 | 0.846 | 0.174 | 0.900 | 0.799 | 0.165 | |
| 0.68 | (Ind-Para) | 0.015 | 0.727 | 0.828 | 0.174 | 0.852 | 0.805 | 0.172 |
| (D-Joint) | 0.000 | 0.739 | 0.838 | 0.178 | 0.875 | 0.803 | 0.174 | |
| 0.70 | (Ind-Para) | 0.022 | 0.710 | 0.812 | 0.175 | 0.814 | 0.810 | 0.175 |
| (D-Joint) | 0.000 | 0.725 | 0.826 | 0.182 | 0.844 | 0.808 | 0.181 |
| Gender | Age | ||||||
|---|---|---|---|---|---|---|---|
| Under 29 | 30-39 | 40-49 | 50-59 | 60-69 | 70-79 | Over 80 | |
| Female | (0.085,0.787) | (0.094,0.779) | (0.101,0.771) | (0.110,0.762) | (0.118,0.754) | (0.129,0.743) | (0.136,0.737) |
| Male | (0.109,0.764) | (0.117,0.755) | (0.128,0.745) | (0.135,0.737) | (0.144,0.727) | (0.155,0.716) | (0.167,0.705) |
| Lower end of | 0.8 | 1.2 | 1.25 | 1.3 | 1.5 | 1.7 | 1.75 | 1.8 | ||||
| Upper end of | 1.8 | 2.2 | 2.25 | 2.3 | 2.5 | 2.7 | 2.8 | 3.3 | 2.75 | 2.2 | 2.5 | 2.8 |
| Non-Asian | 1 | 1 | 6 | 1 | 4 | 184 | 0 | 0 | 4 | 1 | 3 | 5 |
| Asian | 0 | 0 | 0 | 0 | 631 | 1 | 236 | 249 | 0 | 0 | 0 | 0 |
| Lower end of | 2 | 2.1 | 2.2 | 2.25 | 2.3 | 2.5 | 2.75 | 3 | 3 | |||
| Upper end of | 3 | 3.5 | 3.1 | 3.2 | 3.25 | 3.3 | 3 | 3.5 | 3.75 | 3.5 | 4 | |
| Non-Asian | 2805 | 435 | 1 | 11 | 1 | 3 | 2 | 335 | 1 | 2 | 10 | |
| Asian | 286 | 1 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods in Clinical Trials · Statistical Methods and Inference · Advanced Statistical Process Monitoring
Personalized Two-sided Dose Interval
Chan Park1, Guanhua Chen2, Menggang Yu2
1Department of Statistics and Data Science, University of Pennsylvania
2Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
Abstract
In many clinical practices, the goal of medical interventions is often to maintain clinical measures within a desirable range, rather than maximize or minimize their values. To achieve this, it may be more practical to recommend a therapeutic dose interval rather than a single dose for a given patient. Since each patient may respond differently to the same dose of medication, the therapeutic dose interval should be personalized based on the unique characteristics of each patient. This problem is challenging, as it requires jointly learning the lower and upper bound functions for personalized dose intervals. Currently, no methods are available to address this challenge without resorting to iterative procedures to constrain the monotonicity between lower and upper bound functions. However, these iterative procedures do not have a theoretical justification. To fill this gap, we propose a method to learn personalized two-sided dose intervals based on empirical risk minimization associated with a novel loss function. The proposed loss function is designed to be well-defined over a tensor product of function spaces, which eliminates the need for iterative procedures. In addition, our loss function is doubly-robust to the misspecification of nuisance functions. We establish statistical properties of the estimated dose interval functions in terms of excess risk by leveraging approximation theory and reproducing kernel Hilbert space theory. Our simulation study and a real-world application of personalized warfarin dose intervals show that our proposed direct estimation method outperforms competing methods, including indirect regression-based methods.
*Keywords: * Empirical risk minimization, Excess risk, Personalized treatment, Therapeutic dose, Warfarin
1 Introduction
Personalized treatment rules, also known as individualized treatment rules or treatment policies, are strategies for assigning treatments, such as medication or social policy, to individuals based on their characteristics. These strategies have gained attention in the field of medicine due to their superior performance compared to traditional “one-size-fits-all” treatment assignment approaches. For example, many studies have shown that personalized warfarin dosing strategies, which take into account patient characteristics, including pharmacogenetics, are more effective than standard dosing based on an empirical protocol (Anderson et al., 2007; Stergiopoulos and Brown, 2014). Most research on personalized dosing rules, including the warfarin example, produces a single recommended dose level for each patient (Laber and Zhao, 2015; Chen et al., 2016; Kallus and Zhou, 2018; Zhu et al., 2020; Zhou et al., 2021; Schulz and Moodie, 2021; Hua et al., 2022; Wang and Wang, 2023) with the goal of optimizing the individual’s outcome under that rule. However, single-level dose recommendations may be overly ambitious or insufficient since in many real-world applications, the primary goal of treatment is not to achieve a specific value, but rather to have the patient’s health status measurement fall within a favorable range. For instance, warfarin should be prescribed to keep a patient’s international normalized ratio, a measure of the time for the blood to clot, within a desired range, usually between 2 and 3 (January et al., 2014). Similarly, major medical associations recommend targeting appropriate ranges for chronic disease management metrics such as hemoglobin level and blood pressure (American Association of Clinical Endocrinologists, 2019; Flack and Adekola, 2020).
In practice, the relationship between the amount of treatment and the likelihood of favorable outcomes, known as the dose-probability curve, can be biphasic, resulting in an inverted U-shaped curve across dose levels; this phenomenon is called hormesis (Mattson, 2008). These hormetic dosage responses are common in many real-world applications, including the above warfarin example (Blann et al., 2003), clinical trials, public health, biology, toxicology, and medicine (Calabrese and Mattson, 2017). Under hormesis, to have a probability of favorable outcomes greater than a certain level, the treatment dose must fall within a two-sided interval, and consequently determining the optimal personalized therapeutic dose range is challenging, as it requires learning both the lower and upper bounds of the patient-specific dose interval.
However, existing methods designed for optimal single-level dose rules, which rely on the assumption that the dose-probability curve is unimodal (Chen et al., 2016), or for learning one-sided intervals, which rely on the assumption that the dose-probability curve is monotonic (Chen et al., 2022; Park et al., 2021), are not suitable for learning personalized therapeutic dose range under the hormetic dose-probability curve. This is because these methods are suitable for only estimating the lower or upper bounds of two-sided intervals by treating the other bound as fixed. Consequently, to use their approaches, one must resort to an iterative procedure where one bound function is found at a time while keeping the other fixed; see Remark 1 for details. Thus, implementing these approaches can be computationally challenging and, even if successful, they may suffer from convergence issues. In addition, due to the monotonicity constraint on the lower and upper bound estimators imposed in each iteration, theoretical properties of the resulting estimators are difficult to characterize. In particular, theoretical tools for studying the approximation error of a general pair of functions using functions within unrestricted function spaces, such as product reproducing kernel Hilbert spaces, are no longer applicable to a function space with a restricted domain. Furthermore, the method proposed by Cai et al. (2023) can produce two-sided dose interval recommendations, but its goal is not to identify the personalized therapeutic dose range. Instead, the target estimand is an interval that maximizes the interval-wise dose-probability curve, which is the expected probability of observing desired outcomes if the treatment is assigned with a user-specified treatment-distributing policy over a given interval. Their method is motivated by detecting multiple change points in a dose-probability curve, which is different from our motivation.
This paper proposes a new approach to estimate personalized therapeutic dose ranges under the hormetic dose-probability curve. In brief, we solve an empirical risk minimization task using a novel loss function that can accommodate both monotonic and nonmonotonic pairs; here, a monotonic pair refers to a pair of which a lower bound argument is less than or equal to the upper bound argument for all covariate values, while a nonmonotonic pair refers to a pair which violates the monotonic relationship for some covariate values. The proposed function enables simultaneous estimation of the lower and upper bounds of the dose interval without requiring a strictly monotonic relationship between the bounds. The loss function assigns higher values to non-monotonic intervals than to monotonic intervals, ensuring that the estimated dose interval is monotonic when evaluated in the training data. The loss function also has a so-called double robustness property (Scharfstein et al., 1999; Lunceford and Davidian, 2004; Bang and Robins, 2005), making the estimated dose interval robust to the misspecification of the dose-probability curve and the propensity score of treatment. Our simulation study and the warfarin dosing application demonstrate that our methodology is superior to a so-called indirect method discussed in Section 3.1.
2 Preliminary
Let the subscript denote the th study unit. For each unit , we observe which are independent and identically distributed realizations from a distribution . Here, is a -dimensional pretreatment covariate with support , is the treatment dose taking its value in the interval , and is the observed outcome/reward. For simplicity, we assume that the treatment is transformed so that its range is the unit interval. To define the dose assignment rule, we use the potential outcome (Rubin, 1974). Specifically, let denote the potential outcome/reward when the treatment dose is . In what follows, we suppress the subscript unless necessary.
Let be a decision space, a set of dose assignment rules, specified by an investigator, where is a dose assignment rule that maps , the characteristics of a patient, to , a dose level. In the context of learning two-sided dose intervals, we define \mathcal{F}^{\otimes 2}=\big{\{}(f_{L},f_{U})\,|\,f_{L},f_{U}\in\mathcal{F}\big{\}}, a set of pairs of dose levels, as a decision space. Let be the collection of pairs of functions such that the lower bound is smaller than or equal to the upper bound for all . Let be the collection of pairs of functions such that the lower bound is larger than the upper bound for some .
The interval rules can be more useful than the single-level dose assignment rules when one or more of the following scenarios happen: (i) assigning any value in an interval yields almost the same outcome as assigning the optimal treatment rule; (ii) finding the optimal treatment rule is inefficient; (iii) an investigator has subject matter knowledge about the shape of the dose-probability curve. In particular, we consider a -probability dose interval (PDI) introduced in Li (2018) and Chen et al. (2022):
[TABLE]
In words, if a patient with characteristic is treated with a dose level that belongs to , the patient’s outcome belongs to the desired range with a probability greater than . Here, plays an analogous role as a confidence level in hypothesis testing because the corresponding PDI admits type I error with probability at most . As a result, the selection of depends on the desired confidence level for a specific scientific question in view. In order to ensure meaningful PDIs, it is important not to set too high, such as , as this may result in no PDI satisfying the condition, especially when \text{pr}\big{\{}Y^{(a)}\in\mathcal{T}\,|\,{X}\big{\}} is small. The selection of depends on the application and should ideally be guided by domain expertise. If such expertise is not available, we suggest choosing an or using estimated outcome regression to help determine . We remark that and may depend on , i.e., the probability threshold and the desired range can also be personalized. Hereafter, we omit in and for notational brevity. In addition, let be the indicator of whether an individual’s outcome belongs to the desired range . The optimal PDI, denoted by , is the PDI that has the longest length and belongs to . Formally, for a fixed , and satisfy for any and any and satisfying (1).
Lastly, we introduce additional notation used throughout. For random variables , , and , let U\rotatebox[origin={c}]{90.0}{\models}V\,|\,W denote the conditional independence of and given . We use and to denote stochastic boundedness and convergence in probability. For two non-negative sequences and , let denote for some constant , and let denote and . Let be the length vector of ones. Lastly, we denote and , satisfying the decomposition .
Next, in order to establish identification of the PDI, we make the following assumptions.
Assumption 1** (Stable Unit Value Treatment Assumption; Rubin (1978)).**
almost surely;
Assumption 2** (Unconfoundedness).**
We have Y^{(a)}\rotatebox[origin={c}]{90.0}{\models}A\,|\,{X} for any ;
Assumption 3** (Positivity).**
The generalized propensity score (Imbens, 2000) is bounded below by a constant , i.e., for all ;
We refer readers to Hirano and Imbens (2004) and Hernán and Robins (2020) for substantive discussions on these assumptions. Under the assumptions, the probability of obtaining favorable outcomes given covariates, i.e., \text{pr}\big{\{}Y^{(a)}\in\mathcal{T}\,|\,{X}\big{\}}, can be identified from the observed data as \mu^{*}(a,{X})=\text{pr}\big{(}R=1\,|\,A=a,{X}\big{)} which is the dose-probability curve of for a unit with characteristic .
To guarantee the existence of and , we make the following assumptions on :
Assumption 4** (Hormesis).**
For any , there exists a constant such that , , and for some ;
Assumption 5** (Smoothness and Connected PDI).**
For any , is Lipschitz continuous with respect to and a level set \big{\{}a\,|\,\mu^{*}(a,{X})\geq\alpha\big{\}} is a closed interval.
Assumption 4 states that moderate dose levels are more likely to result in a favorable outcome compared to low and high doses, also known as hormesis (Mattson, 2008). This phenomenon can be found in the inverted U-shaped dose-probability curve; see Section 6 for details on warfarin and Calabrese (2008) for additional examples of drugs that have hormetic dose-probability curves. If Assumption 4 is violated at , it may imply that (i) patients having their covariates as always have favorable or unfavorable outcomes regardless of the treatment dose level, i.e., or for all , and/or that (ii) treatment is not hormetic, say increases monotonically with . The first part of Assumption 5 states that is smooth with respect to dose, while the second part states that the therapeutic dose range is a connected interval. Assumptions 4 and 5 guarantee the existence of the optimal PDI, but there are other sufficient conditions exist; see Section A.1 of the Supplementary Material for details.
3 Methodology
3.1 Indirect Method
We present a simple two-step approach to estimate the optimal PDI, often referred to as the indirect method (Moodie et al., 2012; Chakraborty and Moodie, 2013). In the first step, we estimate the dose-probability curve based on a posited outcome model. For instance, one can fit a logistic regression model of on . Alternatively, flexible machine learning classifiers can be used, such as random forest (Breiman, 2001), gradient boosting (Friedman, 2001), neural net (Ripley, 1994), support vector machines (Steinwart and Christmann, 2008), and many others; see Hastie et al. (2009) for additional examples. In the second step, we find the range that satisfies the PDI criterion in (1), which can be done by a grid search over the unit square .
Although the indirect method is simple to implement, it can have notable drawbacks in terms of finite-sample performance as demonstrated in the simulation study and application in Sections 5 and 6. Additionally, the estimated PDI using indirect methods may not be available in a closed form unless has a very simple form. Furthermore, since the indirect rule only uses the dose-probability curve, it is more sensitive to model misspecification than methods using both the dose-probability curve and the generalized propensity score. This motivates our direct approach presented in the following Sections.
3.2 Construction of the Loss and Risk Functions
We propose a method that directly estimates the optimal PDI from empirical risk minimization (ERM). An essential part of the proposed direct method is to design a risk function that is minimized at . If the risk function does not have as a minimizer, the estimator obtained from the ERM does not converge to in probability. We first start by reviewing the loss function in Chen et al. (2022) that has a so-called inverse probability-weighted (IPW) form defined over :
[TABLE]
The loss function is determined by the following four arguments. The first argument, , is the observed data. The second and third arguments, , form a PDI candidate with the monotonicity condition, i.e., . The last argument is a user-specified propensity score model. The IPW loss function above penalizes a study unit for either of the following cases: (i) the outcome does not belong to the desired range, i.e., , even though the dose level belongs to the PDI candidate, or (ii) the outcome belongs to the desired range, i.e., , even though the dose level lies outside of the PDI candidate. These errors can be seen as false positive and false negative, respectively. The coefficients and assign different weights to these two errors. When the propensity score model is correctly specified, the IPW loss function can be used to identify as {\rm E}\big{\{}\mathcal{L}_{\text{IPW}}({O};f_{L},f_{U};e^{*})\} is minimized at .
One drawback of is its sensitivity to misspecification of the propensity score model. Therefore, we can consider the following augmented IPW (AIPW) loss function over that is robust to the propensity score model misspecification:
[TABLE]
Compared with , has one more argument which is a user-specified model for the dose-probability curve. The terms weighted by and those weighted by can be seen as functions that penalize false positive and false negative errors, respectively. These terms resemble the efficient influence function of the average treatment effect (Hahn, 1998), and, not surprisingly, it has the doubly-robust property in the following manner: the AIPW loss function is minimized at the optimal PDI so long as or , but not necessarily both, is correctly specified; see Lemma 3.1 for a related discussion.
The AIPW loss function has a limitation in that it is defined solely over monotonic intervals , i.e., PDI candidates in are excluded from the loss function. Consequently, if is used, the optimization domain, denoted by , must adhere to the monotonicity requirement between the lower and upper bound candidates, which implies that should be a subset of . Moreover, should be rich enough to accurately approximate any function in . These two observations indicate that an ideal choice for would be a rich subset of . However, achieving both of these goals simultaneously is a significant challenge for practitioners for the following reasons. First, to maintain the monotonicity between the lower and upper bounds, iterative procedures are typically utilized in practice. By doing so, may be restricted in a specific way, casting doubt on its suitability as an approximation subspace of . On the other hand, if one naively chooses rich product function spaces, such as a product Reproducing Kernel Hilbert Space (RKHS) as , it would cause problems in the optimization procedure because the AIPW loss function is not compatible with nonmonotonic intervals.
The limitation of the AIPW loss function originates from its constrained domain. One way to resolve the limitation is to design a loss function that is well-defined over and is minimized at the optimal PDI . We can then estimate over a product function space that approximates without the need to restrict the optimization domain. To this end, we propose a new doubly-robust loss function , which is defined below:
[TABLE]
where is a sufficiently large constant such as C_{\mathcal{L}}=2\sup_{{O},f_{L},f_{U}}\big{|}\mathcal{L}^{(1)}({O},f_{L},f_{U};\mu,e)\big{|}. Clearly, allows nonmonotonic pairs, but the corresponding loss value is higher than that of monotonic pairs. Additionally, from simple algebra, we find that the difference between and does not depend on , indicating that the minimizers of and are the same; see Section A.2 of the Supplementary Material for details. In most observational studies, the true nuisance components are unknown and must be estimated from the observed data; see Section 3.3 for details.
The loss function has advantageous properties, which are summarized in the following lemma.
Lemma 3.1**.**
Suppose that Assumptions 1–5 hold. Let be the risk function associated with , i.e., \mathcal{R}(f_{L},f_{U};\mu,e)={\rm E}\big{\{}\mathcal{L}({O},f_{L},f_{U};\mu,e)\big{\}}. Then, the optimal PDI is the minimizer of the risk function with the true nuisance functions, i.e.,
[TABLE]
Additionally, is the minimizer of (8) so long as either or is correctly specified, i.e., (8) holds for and where is bounded in the unit interval and satisfies Assumption 3.
Lemma 3.1 states that the optimal PDI is achieved by minimizing the risk function when both nuisance components are correctly specified. This means that the loss function is Fisher consistent in detecting the optimal PDI. Additionally, if either the dose-probability curve or the propensity score is correctly specified, the loss function remains Fisher consistent. This property is referred to as a doubly-robust property (Scharfstein et al., 1999; Lunceford and Davidian, 2004; Bang and Robins, 2005) and is further discussed in Theorem 4.2.
3.3 Empirical Risk Minimization Via the Difference of Convex Functions Algorithm
Despite its theoretical validity, the loss function in (4) itself is difficult to use in ERM because it is not convex or smooth as induced by the indicator functions and . To address the non-smoothness, we design a smooth surrogate loss function of the loss function by replacing the indicator functions with truncated hinge-type functions as follows:
[TABLE]
where and are truncated hinge-type functions:
[TABLE]
Here, is a bandwidth parameter where and converge to the indicator functions as converges to zero. Importantly, the truncated hinge-type functions used in ensure that the loss function is bounded, making it robust to noisy training data (Wu and Liu, 2007; Chen et al., 2016). Furthermore, the ERM estimator using the surrogate indicator function converges to the optimal PDI when is chosen appropriately; see Section 4 for details.
Using , we obtain a PDI estimator from the ERM with the estimated nuisance components. Let be a user-specified function space, such as linear functions, splines, RKHS, or deep neural networks. For the remainder of the paper, we focus on the case where is an RKHS mainly due to its widely recognized approximation capability and computational efficiency. Specifically, let be an RKHS with the radial basis kernel function k({x},{x}^{\prime})=\exp\big{(}-\|{x}-{x}^{\prime}\|_{2}^{2}/\gamma^{2}\big{)} where is a bandwidth parameter, but other universal kernels can be used. We solve the ERM over the tensor product RKHS . From the representer theorem (Kimeldorf and Wahba, 1970; Schölkopf et al., 2001), the solution to the ERM, denoted by , is represented as linear combinations of the kernel-transformed covariates, i.e., and where the coefficients , , , and are obtained from an ERM over an Euclidean space below:
[TABLE]
Here, is the gram matrix and is the th column of .
Although (3.3) provides the closed-form solution of the PDI estimator, it is still difficult to solve it because of the nonconvexity of the loss function. While other nonconvex optimization methods can be applied, we use the Difference of Convex Functions algorithm (An and Tao, 1997) due to its ready adaptation to our problem. Briefly, we decompose the surrogate loss function into a difference between two convex functions, say . We then iteratively solve a sequence of convex optimization problems using and ; see Section A.3 of the Supplementary Material for details. In addition, in Section A.4 of the Supplementary Materials, we discuss practical considerations such as (i) choices of hyperparameters and initial points for solving (3.3), and (ii) remedies to address invalid intervals.
Remark 1**.**
Instead of learning and jointly, one can use an iterative procedure to optimize the lower or upper bound function at a time while the other is treated as fixed. In particular, the procedure can be summarized as follows: (i) given an estimator of the lower bound at iteration , denoted by , the upper bound is only estimated, denoted by , with the restriction ; (ii) given the new upper bound estimator , the lower bound is estimated, denoted by , with the restriction ; and (iii) repeat (i) and (ii) until both bound estimators converge. The procedure can be viewed as an extension of the blockwise optimization method, also known as the nonlinear Gauss-Seidel method; see Wright (2015) for details. However, we argue that the iterative procedure is suboptimal for several reasons. First, there is no guarantee of its convergence and it faces a theoretical issue due to the restrictions on the PDI estimators, i.e., and . These constraints restrict the range of the PDI estimators in a complex way, making the optimization domain smaller than the unrestricted product RKHS . Consequently, the approximation theory established under unrestricted RKHSes may not be applicable to the estimator obtained from this iterative procedure and the restricted RKHS. On the other hand, the ERM in (3.3) simultaneously uses both lower and upper bound estimators without any restrictions over the candidates, making the computation simpler and the RKHS theory applicable to our setting at the expense of more complicated, yet closed-form surrogate loss function.
Remark 2**.**
One may restrict the relationship between and so that the width of the PDI is constant for all , i.e., for a non-negative constant . The estimators of the PDI are represented as and where the coefficients , , and are obtained from the following ERM:
[TABLE]
While this constant-width PDI is easier to learn than the proposed method, our findings indicate that this constant-width PDI estimator did not perform as effectively as the proposed approach; see Section 5 for details. Furthermore, in the context of the warfarin dataset analyzed in Section 6, the estimated PDI exhibited significant variations with respect to age and gender in the warfarin dataset; see Table 3 for details. This suggests that the constant-width PDI model may be unreasonable for numerous real-world applications including the warfarin dataset.
We conclude the section by presenting some methods for estimating nuisance functions and . First, as assumed in Assumption 4, the dose-probability curve has a hormetic form, so it is desirable to obtain an estimator of that maintains the same form. Some examples of estimation techniques are (semi-)parametric models with a quadratic term of the dose level, and carefully designed nonparametric methods such as multidimensional monotone Bayesian additive regression trees (Chipman et al., 2022). Second, the generalized propensity score is a conditional density due to the continuous nature of , which is often regarded as a statistically more difficult object than the propensity score under discrete treatment. Again, one may use (semi-)parametric models to estimate the conditional density of given or nonparametric methods (Li and Racine, 2008; Zhu et al., 2015; Schuster et al., 2020). If nonparametric methods are used to estimate nuisance functions, we recommend using the cross-fitting procedure (Schick, 1986; Chernozhukov et al., 2018) to avoid the potential risk of having non-diminishing bias of PDI estimators; see Section A.6 of the Supplementary Material for details. Additionally, the cross-fitting procedure helps characterize the excess risk of the estimated PDI; see the next section for details. However, in practice, nonparametrically estimated nuisance functions can be unstable due to limited sample size and may yield little or no improvement compared to well-designed parametric methods. For this reason, in the simulation and warfarin application, we utilized parametrically estimated nuisance functions.
4 Theoretical Properties
We examine theoretical properties of the PDI estimator obtained from the ERM with cross-fitting. Let be the expectation operator that regards the split sample as given. Note that, for a generic function estimated from , denoted by , we have {\rm E}^{(-s)}\big{\{}\widehat{g}^{(-s)}({O})\big{\}}=\int\widehat{g}^{(-s)}({o})\,dP({o}). Similarly, we denote \mathcal{R}^{(-s)}(f_{L},f_{U};\mu^{(-s)},e^{(-s)}\big{)}={\rm E}^{(-s)}\big{\{}\mathcal{L}(f_{L},f_{U};\mu^{(-s)},e^{(-s)}\big{)}\big{\}} and analogously.
First, we show that using the surrogate indicator function is a reasonable choice when is small under an additional condition.
Lemma 4.1**.**
*Let be the PDI estimator obtained from the ERM with cross-fitting procedure. Suppose Assumptions 1–5 and regularity conditions (R1)–(R4) in Section B of the Supplementary Material hold. Then, we have the following results for some constant :
(Result 1) We have \big{|}\mathcal{R}_{\text{sur}}^{(-s)}(f_{L}^{*},f_{U}^{*};\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})-\mathcal{R}^{(-s)}(f_{L}^{*},f_{U}^{*};\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})\big{|}\leq c_{0}\epsilon;
(Result 2) Additionally, suppose that the following regularity condition (R5) holds:
(R5) For any sequence , there exists a finite non-negative number and satisfying {\rm E}^{(-s)}\big{[}\mathbbm{1}\big{\{}\widehat{f}_{L}^{(s)}({X})-\widehat{f}_{U}^{(s)}({X})\in(0,a_{N})\big{\}}\big{]}\leq a_{N}b_{0}N^{b}.
Then, we have \big{|}\mathcal{R}_{\text{sur}}^{(-s)}(\widehat{f}_{L}^{(s)},\widehat{f}_{U}^{(s)};\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})-\mathcal{R}^{(-s)}(\widehat{f}_{L}^{(s)},\widehat{f}_{U}^{(s)};\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})\big{|}\leq c_{0}\epsilon N^{b}.*
(Result 1) of Lemma 4.1 states that, by viewing the estimated nuisance components as fixed functions, the surrogate risk function evaluated at the optimal PDI is not far from the original risk function evaluated at the optimal PDI. (Result 2) of the Lemma states that a similar result is established for the PDI estimator obtained from the ERM under the additional condition (R5), which means that the probability of and being closer than a small bandwidth is proportional to the bandwidth with a factor . To better understand the condition, one can consider a derivative of the left-hand side with respect to at 0, which reduces the condition that the derivative value is upper bounded by a polynomial function . This implies that the probability density function of a random variable , which can be seen as the degree of nonmonotonicity, is effectively bounded by a polynomial function. As a consequence, the condition is very plausible in most settings.
Next, we present the result of the convergence of the estimated PDI in terms of the excess risk bound.
Theorem 4.2**.**
Let be the PDI estimator obtained from ERM with the cross-fitting procedure. Suppose that (i) regularity conditions in Lemma 4.1 hold; (ii) the optimal PDI bounds and belong to a Besov space with smoothness parameter , i.e., where is the modulus of continuity of order ; and (iii) there exist constants , , and satisfying E^{(-s)}\big{[}\big{\{}\widehat{\mu}^{(-s)}(A,X)-\mu^{*}(A,X)\big{\}}^{2}\big{]}\leq C\cdot N^{-2r_{\mu}}, and E^{(-s)}\big{[}\big{\{}\widehat{e}^{(-s)}(A\,|\,X)-e^{*}(A\,|\,X)\big{\}}^{2}\big{]}\leq C\cdot N^{-2r_{e}} with probability not less than where as . Then, for all fixed , , , and , we obtain the following result the with probability not less than :
[TABLE]
where constants do not depend on .
Theorem 4.2 states that the estimated PDI with the cross-fitting procedure converges to the optimal PDI in terms of the excess risk bound. This convergence is subject to several regularity conditions, including the boundedness of the estimated nuisance components and the density of , the smoothness of the optimal PDI, the condition outlined in Lemma 4.1, and certain properties of the surrogate loss functions. If the hyperparameters are chosen with specific rates, specifically , , and where is the parameter satisfying (R5) in Lemma 4.1, the excess risk bound has a rate of O_{P}\big{(}N^{-r_{e}-r_{\mu}}+N^{-\beta/(2\beta+d)}\big{)}, which has two leading terms. The first term is proportional to the product of the convergence rates of the nuisance components. Typically, this first term has a rate of because many machine learning methods can estimate each nuisance function at a rate of such as lasso (Belloni and Chernozhukov, 2011, 2013), random forests (Wager and Walther, 2016; Syrgkanis and Zampetakis, 2020), neural networks (Chen and White, 1999; Farrell et al., 2021), and boosting (Luo and Spindler, 2016). If at least one of the true nuisance functions were known, as in experimental settings where was known, the first term would be zero. The second term is from the approximation error of using the ERM over the Gaussian RKHS. When the optimal PDI is sufficiently smooth compared to the dimension of the covariate, i.e., , the second term approaches .
5 Simulation
We conducted a simulation study to assess the finite-sample performance of the proposed method. We generated patients with a 10-dimensional covariate , dose level , and outcome . We generated from , from , and from , where each component of the covariates was mutually independent. We then generated from N\big{(}\sum_{j=1}^{10}X_{j}/15+0.2,0.1\big{)}, and from N\big{(}\nu(A,{X}),0.25\big{)} where \nu(a,{x})=1.1-(2.5+10x_{1})\big{(}m_{1}-a\big{)}_{+}-(0.5+x_{7}+2x_{10})\big{(}a-m_{2}\big{)}_{+} with and . We chose the desired range as and the indicator was accordingly defined as R=\mathbbm{1}\big{(}Y\geq 0.75\big{)}. We considered the range of the probability level in .
Using the generated data, we first estimated the PDI based on the approach proposed in Sections 3.2 and 3.3. Specifically, we estimated the propensity score and the dose-probability curve using parametric models that are widely used in practice. For the propensity score, we fitted a linear regression model of regressed on positing a normal error and obtained the density estimate based on the regression model, i.e., where is the density of , and and are the estimated mean function and variance obtained from the regression model, respectively. We obtained the estimated dose-probability curve based on a logistic regression model of regressed on . For the surrogate loss function bandwidth parameter, we chose . For the kernel bandwidth parameter, we considered a value from . For the regularization parameter, we considered . For the initial points, we considered the internal division rule in Section A.4 of the Supplementary Material with the ratio . Lastly, we adopted the monotonicity-inducing regularization term in (13) of the Supplementary Material to decrease the chance of producing invalid bound estimates with the parameter ; note that when , the optimization problem reduces to (3.3). We chose the best set of hyperparameters based on 10-fold cross-validation. The described PDI estimator is referred to as (D-Joint).
For comparison, we considered the following four estimators. First, we implemented the constant-width PDI estimator in Remark 2, which is referred to as (D-CW). Second, we obtained the indirect PDI estimator by following Section 3.1 based on the dose-probability curve estimator used in (D-Joint); this indirect estimator is referred to as (Ind-Para). Furthermore, we obtained two additional indirect PDI estimators where the dose-probability curve was estimated by random forest (Breiman, 2001) and by an ensemble learner of random forest and gradient boosting (Friedman, 2001); these PDI estimators are referred to as (Ind-RF) and (Ind-Ens), respectively. For indirect methods, we performed grid search to find an interval that made the estimated outcome regression greater than the probability level , i.e., . If such an interval did not exist, we took the dose level maximizing the estimated dose-probability curve as the indirect rule recommendation, i.e., . We evaluated the performance of the PDI estimators obtained from the three indirect and the two direct methods using a test dataset having observations generated from the same distribution as the training data. We repeated the simulation 100 times.
To evaluate the performance of a PDI estimator , we considered the following criteria. First, we focused on the proportion in which a method in view does not produce a valid PDI interval. Second, we focused on the mean absolute error (MAE) and the mean squared error (MSE), which are defined by \text{MAE}=M^{-1}\sum_{i\in\mathcal{D}_{\text{test}}}\big{\{}|f_{L}^{*}({X}_{i})-\widehat{f}_{L}({X}_{i})|+|f_{U}^{*}({X}_{i})-\widehat{f}_{U}({X}_{i})|\big{\}} and \text{MSE}=M^{-1}\sum_{i\in\mathcal{D}_{\text{test}}}\big{[}\{f_{L}^{*}({X}_{i})-\widehat{f}_{L}({X}_{i})\}^{2}+\{f_{U}^{*}({X}_{i})-\widehat{f}_{U}({X}_{i})\}^{2}\big{]}, respectively. Lastly, we also evaluated the performance of the PDI estimators in terms of widely used classification performance measures. Specifically, for each PDI estimator, we defined the contingency table counts, true positives, true negatives, false positives, and false negatives. Using these contingency table counts, we calculated accuracy, F1 score, Matthew’s correlation coefficient (MCC) (Matthews, 1975), recall, precision, and Cohen’s kappa (Cohen, 1960); see Section A.7 of the Supplementary Material for details. A larger value of these classification measures indicates better performance with fewer type I and/or II errors.
The results of the simulation are summarized in Table 1. First, the two direct methods always generate valid PDI intervals regardless of the value of , while the three indirect methods may produce invalid PDI intervals. In particular, when , the proportion of invalid intervals is greater than 15% for all indirect methods. Second, in terms of the MAE and MSE, (D-Joint) uniformly outperforms the competing methods. Lastly, in terms of classification performance measures, (D-Joint) uniformly exhibits higher accuracy, F1 score, MCC, and Cohen’s kappa compared to the other methods. In terms of recall, (D-Joint) outperforms the three indirect methods although it is sometimes dominated by (D-CW). In terms of precision, (Ind-Ens) performs the best for all . However, it is crucial to consider that the MAE and MSE of (Ind-Ens) are larger than those of (D-Joint), suggesting that the precision measure may not be a proper evaluation criterion. Combining all numerical results, we conclude that the proposed direct method (D-Joint) outperforms the other methods by always producing valid intervals, resulting in a smaller bias and fewer type I and II errors.
6 Application: Warfarin Dosing
We applied our method to study the two-sided PDI of warfarin, an anti-coagulant medicine. Warfarin is a medicine for preventing harmful blood clots from forming and growing bigger. It is particularly useful for patients suffering from conditions that can cause fatal blood clots such as stroke and heart attack. However, excessive doses of warfarin can lead to severe complications, including life-threatening bleeding and tissue death. Therefore, it is essential to prescribe warfarin in a suitable therapeutic dose to ensure its benefits to patients. As recommended by the American Heart Association, warfarin should be prescribed to keep a patient’s international normalized ratio (INR) within a desired range, typically between 2 and 3 (January et al., 2014). The INR level increases with the warfarin dose, leading to a dose-probability curve with an inverted U-shaped pattern, reflecting the hormetic response (Blann et al., 2003).
We used the dataset presented in The International Warfarin Pharmacogenetics Consortium (2009) to estimate the PDI of warfarin. Specifically, each patient’s target INR range, the reported INR value, and the warfarin dose measured in mg/week were considered as the desired range , the observed outcome , and the observed dose level , respectively. Accordingly, indicates whether the INR of a patient was within the desired range. Furthermore, we included the following pharmacogenetic and clinical variables as pre-treatment covariates : gender, weight measured in kg, height measured in cm, discretized age with nine levels, and an indicator of whether a patient took Amiodarone. We restricted our analysis to Asians because the targeted INR ranges for Asians were significantly different from those for other races; see Section A.8 of the Supplementary Material for details. Lastly, we discarded two observations with outlying dose levels greater than 80mg/week to guarantee the positivity assumption 3. Consequently, we used 1407 patients in the analysis.
In our analysis, we implemented the proposed direct method, which corresponds to (D-Joint) in the simulation study, as follows. We split the sample into two sets, with patients in the training set and patients in the test set . We repeated the split times, and for each iteration, we used the training set to estimate the PDI, and the test set to evaluate the estimated PDI. Using the training set, we estimated the propensity score based on the linear regression model of on with normal error and the dose-probability curve based on the logistic regression model of on , , and , which are denoted by and , respectively. We note that was used in the propensity score model because it suggested better model diagnostics; see Section A.8 for details. We then estimate the PDI from the ERM (17) where the hyperparameters were chosen based on the same cross-validation procedure as in the simulation study except that we considered the range of . The probability level was chosen from . For comparison, we also implemented the indirect method using the same parametric model, which corresponds to (Ind-Para) in the simulation study.
In Table 2, we report the proportion of producing an invalid PDI and the classification performance measures that were used in the simulation. First, we find that the proportion of invalid PDI estimates occurs more often in the indirect method compared to the direct method, but the proportion of invalid estimates was negligible. Second, the classification performance measures generally suggest that the direct method yields more accurate estimates. Specifically, the direct method uniformly shows higher accuracy, F1 score, MCC, and recall measures. In terms of Cohen’s kappa, the direct method exhibits better performance at higher . On the other hand, the direct method performs worse in terms of precision; however, as discussed in the simulation, precision may not be a proper evaluation criterion. Combining all results from both simulation and data analysis, the direct method can be a valuable tool for practitioners seeking to determine the therapeutic dose of warfarin.
We conclude the data analysis by providing the summary of the PDI estimates obtained from the proposed direct method at . Table 3 summarizes the relationship between the PDI estimates and two covariates, age and gender. In terms of age, we find that the PDI interval is wider for younger patients and narrower for older patients. This result is consistent with previous medical studies which have established a negative correlation between age and the therapeutic dose range of warfarin (Khoury and Sheikh-Taha, 2014; Shendre et al., 2018). In terms of gender, we find that the PDI interval is wider for female patients, but the relationship between gender and the therapeutic warfarin dose range remains uncertain based on previous medical studies. Some studies did not find significant differences in the therapeutic warfarin dose range between male and female patients (Poli et al., 2009; Miao et al., 2007), while others suggested that gender is an important factor, especially for Asian patients (Choi et al., 2011; Liew et al., 2013). Our analysis, which only included Asian patients, seems to support the latter view. However, more research is needed to establish the role of gender in determining the therapeutic dose of warfarin.
7 Concluding Remarks
In this paper, we present a method for estimating the personalized two-sided PDI. To directly estimate the PDI from an ERM, we develop a loss function that is robust to misspecification of the nuisance components (the propensity score and the dose-probability curve), and design a computationally tractable surrogate loss function for obtaining PDI estimators over the product RKHS. We show that the excess risk bound of our estimated PDI converges to the true optimal PDI at a rate that depends on the estimation error of the nuisance parameters and the approximation error of the PDI using functions within the product RKHS. The simulation results show that our direct method can yield PDIs that are more likely to be valid and more accurate in terms of bias and classification performance measures when compared to indirect methods.
The proposed methodology has a limitation that it only considers a moderate number of covariates. The method can be extended to accommodate high-dimensional covariates in order to broaden its applicability and better suit diverse medical applications. We plan to pursue this direction in future research. We also note that our method relies on the hormesis, i.e., Assumption 4. This assumption restricts the shape of the dose-probability curve, but can be empirically verified in certain cases, such as the warfarin data example we provide. However, in cases where this assumption is difficult to verify due to insufficient observations or heterogeneous data sources, an investigator should be aware of this assumption and consider focusing on revealing the mechanism between the treatment and outcome through clinical trials.
Supplementary Material
Section A contains details of the main paper. Section B provides the regularity conditions assumed in Section 4 of the main paper. Section C contains the proofs of the lemmas and theorems in the main paper.
Appendix A Details of the Main Paper
A.1 Existence of the Optimal PDI
We discuss the existence of the optimal PDI, , under Assumptions 4 and 5. It is trivial that in 4 belongs to the level set \big{\{}a\,|\,\mu^{*}(a,{X})\geq\alpha\big{\}}, which is uniquely determined based on Assumption 5. We now show that \big{\{}a\,|\,\mu^{*}(a,{X})\geq\alpha\big{\}} is not a point, but an interval. Let be the Lipschitz constant of , i.e., \big{|}\mu^{*}(a_{1},{X})-\mu^{*}(a_{2},{X})\big{|}\leq L_{\mu}\big{|}a_{1}-a_{2}\big{|}. Since , we find and , indicating that [a_{X}-c_{\mu}/L_{\mu},a_{X}+c_{\mu}/L_{\mu}]\subset\big{\{}a\,|\,\mu^{*}(a,{X})\geq\alpha\big{\}}. As a result, we find
[TABLE]
indicating .
A.2 Details of the AIPW Loss Function
Recall that
[TABLE]
We first introduce the surrogate loss function of .
[TABLE]
Here \mathbbm{1}\big{\{}A\in[f_{L}({X}),f_{U}({X})]\big{\}} is replaced with \Psi_{\epsilon}\big{(}f_{L}({X}),A,f_{U}({X})\big{)}.
We find the two IPW terms in are
[TABLE]
where . The two outcome-integrated terms are
[TABLE]
where . Therefore, does not depend on .
A.3 Details of the Convex Decomposition of the Surrogate Loss Function and the DC Algorithm
We consider the convex decomposition of the surrogate loss function. First, we consider the convex decompositions of and :
[TABLE]
Note that and ; see Figure 1 for graphical illustrations of and and surrogates of these functions.
We introduce new functions. Let and be non-negative, non-decreasing, Lipschitz continuous functions in satisfying . Since is Lipschitz continuous, we can easily find such and . For , we additionally define and as follows:
[TABLE]
Using these functions, we define and as follows:
[TABLE]
and
[TABLE]
where is a constant that ensures the convexity; we recommend choosing a value for such that , e.g., . When is large enough, and are convex in . Consequently, we find the following functions and are convex decomposition of in (3.3), i.e., and are convex and satisfy where
[TABLE]
Using these functions, the DC algorithm can be implemented by following Algorithm 1.
A.4 Practical Considerations for Implementing the ERM
We discuss practical considerations for implementing the optimization problem (3.3). First, we need to choose the hyperparameters, namely the kernel bandwidth parameter , and the regularization parameter . We recommend using cross-validation to select the optimal hyperparameters, as outlined in Algorithm 2 in the Supplementary Material. Additionally, the solution to the DC algorithm depends on the initial number of coefficients. Consequently, greedy algorithms are not appropriate for searching the initial points when the dataset has a moderate to large number of observations. Therefore, we propose an alternative strategy as follows. Given the indirect rules for , we consider the following internal division points with the internal division ratio parameter :
[TABLE]
In other words, the points are homogeneous and are close to the average of the indirect rules at small . When , the internal division points for the lower bound are chosen as the average of the indirect rules for all study units. On the contrary, when , the internal division points are chosen as the indirect rules themselves. For the given internal division points and , and the gram matrix , we find the coefficient vectors satisfying
[TABLE]
That is, we find the coefficients that are associated with the internal division points. Similar to the other hyperparameters, we choose the most optimal internal division ratio parameters via cross-validation.
Lastly, we consider some remedies to address the invalid intervals; notably, the PDI estimate can be invalid when (i) the bound estimate does not lie over a proper dose range or (ii) the upper bound estimate is not larger than the lower bound estimate, i.e., non-monotonic bounds. For the first violation, we simply winsorize the estimates escaping the dose range; we remark that the winsorization does not affect its statistical property; see 4.2 for details. To avoid the second violation, in the ERM, we may introduce an additional regularization term penalizing when violations happen. For instance, in the optimization problem in (3.3), we may use the following having an additional regularization term:
[TABLE]
The new regularization term assigns a larger loss function value if the lower bound coefficient is larger than the upper bound coefficient. This helps to ensure that the PDI estimate stays within the specified dose range. Again, the regularization parameter can be determined through cross-validation. To further penalize any potential violations, we may also include additional regularization terms. Despite these measures, there may still be cases where violations occur in the test data. In such instances, we resort to the pointwise dose rule by taking the average of and . However, our simulations and data analysis show that the direct method with the coefficient-regularization term results in very few violations; see the simulation in Section 5 for empirical evidence.
A.5 Details of Cross-validation
We summarize the cross-validation procedure in Algorithm 2.
A.6 Details of Cross-fitting
Let \mathcal{D}=\big{\{}1,\ldots,N\big{\}} be the collection of indices of the entire data. Suppose we use the entire data (i) to estimate the nuisance functions, and (ii) to estimate the PDI by solving the ERM where we use the estimated nuisance function in (i). Then, each observation is used twice in (i) and (ii), and this makes the characterization of the theoretical properties of PDI estimators difficult, and may yield inconsistent PDI estimates unless the nuisance functions are estimated within function spaces of which complexity is well-controlled. To resolve these potential problems, we may use cross-fitting (Schick, 1986; Chernozhukov et al., 2018). Specifically, we split the entire data into non-overlapping split samples, use a subset of data to estimate the nuisance functions, and use the unused split sample to solve the ERM for estimating the PDI. Algorithm 3 below summarizes the cross-fitting procedure.
A.7 Details of the Simulation
First, we provide the details of contingency table counts below:
[TABLE]
Using these contingency table counts, accuracy, F1 score, and the Matthews correlation coefficient (MCC) (Matthews, 1975), recall, precision, and Cohen’s kappa (Cohen, 1960) are defined as follows:
[TABLE]
A.8 Details on the Data Analysis
We first present a graphical summary of the estimated dose-probability curve. Figure 2 shows the estimated dose-probability curve which has an inverted-U shape for almost all observations. Additionally, visual inspection guarantees the existence of the PDI for .
Next, we validate the propensity score model. Figure 3 shows the empirical distribution of the observed dose level and its log-transformed value , and the QQ plots of the residuals obtained from the parametric propensity score model where or is regressed on the covariates , respectively. The visual diagnosis suggests that the propensity score model using the log-transformed dose is better than that using the raw dose.
Lastly, we compare the desired ranges for Asians and those for other races. Table 4 shows the frequencies of for Asians and non-Asians who have a complete set of covariates. We find that the desired ranges for Asians are very different from those for non-Asians. Therefore, we only focus on the Asian subgroup to make homogeneous.
Appendix B Regularity Conditions
In this section, we provide the regularity conditions assumed in Section 4 of the main paper.
- (R1)
The estimated propensity score is lower bounded by a constant , i.e., for all . Additionally, the estimated dose curve is bounded in the unit interval, i.e., .
- (R2)
The density of , , is bounded above by a constant.
- (R3)
The optimal PDI bounds and belong to a Besov space with smoothness parameter , which is where is the modulus of continuity of order .
- (R4)
Let the surrogate loss function at the true and estimated nuisance components, i.e., or . Then, it satisfies the following conditions:
- (C1)
For all , there exists a constant satisfying .
- (C2)
is locally Lipschitz continuous with respect to for .
- (C3)
For all , we have where \ell^{W}=\ell\cdot\mathbbm{1}\big{\{}|\ell|\leq c_{0}\big{\}}+\text{sign}(\ell)c_{0}\cdot\mathbbm{1}\big{\{}c_{0}<|\ell|\big{\}}.
- (C4)
{\rm E}\big{[}\big{\{}L\big{(}{O},\widehat{f}_{L}^{W},\widehat{f}_{U}^{W}\big{)}-L\big{(}{O},f_{L}^{*},f_{U}^{*}\big{)}\big{\}}^{2}\big{]}\leq V\cdot\big{[}{\rm E}\big{\{}L\big{(}{O},{f}_{L}^{W},{f}_{U}^{W}\big{)}-L\big{(}{O},f_{L}^{*},f_{U}^{*}\big{)}\big{\}}\big{]}^{v} is satisfied for constant , , and for all .
- (R5)
For any sequence , there exists a finite non-negative number and satisfying {\rm E}^{(-s)}\big{[}\mathbbm{1}\big{\{}\widehat{f}_{L}^{(s)}({X})-\widehat{f}_{U}^{(s)}({X})\in(0,a_{N})\big{\}}\big{]}\leq a_{N}b_{0}N^{b}.
Assumption (R1) states that the estimated propensity score and dose-probability curve are well-defined. Assumption (R2) states that the covariate does not concentrate to a certain point. Assumption (R3) states that the optimal PDI is smooth with parameter . Assumption (R4) provides additional conditions on the surrogate loss function. With some algebra, we find that the surrogate loss function satisfies these conditions by (i) adding a sufficiently large constant to make non-negative; (ii) adding a very large constant for PDI candidates outside of the proper dose range . Then, we find each condition is satisfied as follows:
- (C1)
Trivial after adding a baseline constant.
- (C2)
The surrogate loss function is Lipschitz continuous for given due to its construction.
- (C3)
Trivial after adding a penalizing constant over .
- (C4)
This holds with and .
Lastly, Assumption (R5) is the same as the condition in Lemma 4.1, and we discussed the plausibility of the assumption there.
Appendix C Proof
C.1 Proof of Lemma 3.1
Let be the minimzer of the risk function . Based on the form of the loss function , the minimizer should satisfy for all . For a PDI candidate satisfying , we find
[TABLE]
If or , we have
[TABLE]
Therefore, the minimizer must have a form
[TABLE]
which is equivalent to the definition of the PDI, i.e., . This completes the proof.
C.2 Proof of Lemma 4.1
The proof is a part of the proof Theorem 4.2; see the derivation of (15) for details.
C.3 Proof of Theorem 4.2
We first decompose the difference of the two risk functions as follows.
[TABLE]
We bound , , and , respectively, in the following bullet points. As a consequence, we get the following result:
[TABLE]
In the rest of the proof, we obtain the rate of each term (i.e., (14), (15), and (16)):
**Upper bound of
**
For any measurable functions , the difference between the loss function at and is
[TABLE]
Note that for all , we have
[TABLE]
and
[TABLE]
As a consequence, is upper bounded as follows.
[TABLE]
The two inequalities are obtained by applying the Hölder’s inequality. In particular, we used the assumptions that the (estimated) propensity scores are away from zero. Similarly, (A2) is upper bounded by the same cross-product, i.e., (A2)\precsim\big{\|}\widehat{e}^{(-s)}-e^{*}\big{\|}_{P,2}\big{\|}\widehat{\mu}^{(-s)}-\mu^{*}\big{\|}_{P,2}. Consequently, we have a constant satisfying
[TABLE]
**Upper bound of
**
For any measurable functions , we consider the following three cases where (i) , (ii) , and (iii) .
— Case 1:
We find \Phi_{\epsilon}\big{(}f_{L}({X}),f_{U}({X})\big{)}=\mathbbm{1}\big{\{}f_{L}({X})\leq f_{U}({X})\big{\}}. Thus,
[TABLE]
Since is uniformly bounded under Assumption (R1), we have
[TABLE]
Given , we find
[TABLE]
which implies
[TABLE]
The inequality is from the positivity assumption 3, and the last equality is trivial. Combining the result, we get
[TABLE]
— Case 2:
We have since .
— Case 3:
From the range of the loss functions, we have
[TABLE]
Consequently,
[TABLE]
Combining three cases, we find
[TABLE]
At the true PDI, we have {\rm E}^{(-s)}\big{[}\mathbbm{1}\big{\{}f_{L}^{*}({X})-f_{U}^{*}({X})\in[0,\epsilon]\big{\}}\big{]}=0. At the estimated PDI, this term is assumed to be upper bounded by a term having a rate . As a consequence, we get an upper bound of for some constant :
[TABLE]
Note that this provides the proof of Lemma 4.1. Consequently, by choosing , we get
[TABLE]
**Upper bound of
**
The outline of the proof is similar to that of Theorem 2 of Chen et al. (2016). We remark that lemmas and theorems presented in Steinwart and Christmann (2008) and Eberts and Steinwart (2013) are essential for the proof. For notational brevity, let , R(\ell,u)={\rm E}^{(-s)}\big{\{}\mathcal{L}_{\text{sur}}({O},\ell,u;\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})\big{\}}, and the expectation involving is calculated conditioning on the estimation split data, i.e., {\rm E}\big{\{}g(L)\big{\}}:={\rm E}^{(-s)}\big{\{}g(\mathcal{L}_{\text{sur}}(\cdot;\widehat{\mu}^{(-s)},\widehat{e}^{(-s)})\big{\}}. Let be the number of observations in the ERM dataset , which is proportional to .
Our ultimate goal is to show the following inequality holds
[TABLE]
with probability not less than where does not depend on . Then, taking and , we get the following asymptotic rate with probability not less than :
[TABLE]
Since the estimated nuisance functions are random (depending on ), we use -notation instead of -notation.
To show the result, we first present a formal result by extending Theorem 7.23 of Steinwart and Christmann (2008) to our setting.
Theorem C.1** **(Extension of Theorem 7.23. of Steinwart and
Christmann (2008)).
Let be a loss function having non-negative value. Also, let be a Gaussian RKHS with bandwidth parameter over . Consider the ERM of the form
\displaystyle\big{(}\widehat{f}_{L},\widehat{f}_{U}\big{)}=\operatorname*{arg\,min}_{(f_{L},f_{U})\in\mathcal{H}^{\otimes 2}}\bigg{[}\frac{1}{N}\sum_{i=1}^{N}L\big{(}{O}_{i},f_{L},f_{U}\big{)}+\lambda\Big{\{}\big{\|}f_{L}\big{\|}_{\mathcal{H}}^{2}+\lambda\big{\|}f_{U}\big{\|}_{\mathcal{H}}^{2}\Big{\}}\bigg{]}\ .
(17)
Let and be distributions of and , respectively. Furthermore, suppose regularity conditions (R2), (R3), and , (R5) hold. Assume that for fixed , there exist constant , and such that
\displaystyle{\rm E}_{\mathcal{O}\sim P_{{O}}^{|\mathcal{D}_{s}|}}\Big{[}e_{n}\big{(}\text{identity map}:\mathcal{H}^{\otimes 2}\rightarrow L_{2}^{\otimes 2}(\mu)\big{)}\Big{]}\leq\mathfrak{a}\cdot i^{-\frac{1}{2p}}\ ,\quad i\geq 1\ .
(18)
Finally, fix and a constant such that \big{\|}L(O,f_{L,0},f_{U,0})\big{\|}_{\infty}\leq B_{0}. Then, for all fixed , and , the ERM (17) using and satisfies
\displaystyle\lambda\big{\|}\widehat{f}_{L}\big{\|}_{\mathcal{H}}^{2}+\lambda\big{\|}\widehat{f}_{U}\big{\|}_{\mathcal{H}}^{2}+R\big{(}\widehat{f}_{L}^{W},\widehat{f}_{U}^{W}\big{)}-R\big{(}f_{L}^{*},f_{U}^{*}\big{)}
\displaystyle\leq 9\big{\{}\lambda\big{\|}f_{L,0}\big{\|}_{\mathcal{H}}^{2}+\lambda\big{\|}f_{U,0}\big{\|}_{\mathcal{H}}^{2}+R\big{(}f_{L,0},f_{U,0}\big{)}-R\big{(}f_{L}^{*},f_{U}^{*}\big{)}\big{\}}
\displaystyle\hskip 28.45274pt+K_{0}\bigg{\{}\frac{\mathfrak{a}^{2p}}{\lambda^{p}n}\bigg{\}}^{\frac{1}{2-p-v+vp}}+3\bigg{(}\frac{72V\tau}{n}\bigg{)}^{\frac{1}{2-v}}+\frac{15B_{0}\tau}{n}\ .
with probability not less than , where where is a constant only depending on , , , , and .
We establish Theorem C.1 following the proof of Theorem 7.23 of Steinwart and Christmann (2008). When , we find
[TABLE]
where is the empirical risk . Therefore, the claim holds whenever .
Next, we consider . Let be
[TABLE]
and we further define the following sets for :
[TABLE]
Let be the closed unit ball of . For , we have
[TABLE]
This implies also belong to .
To proceed, we adopt the approach used in the proof of Lemma 7.17 of Steinwart and Christmann (2008) to our setting. Given that , we have the relationship for (dyadic) entropy number :
[TABLE]
Now let be an -net of with respect to where
[TABLE]
See Section A.5.6 of Steinwart and Christmann (2008) for details on the dyadic entropy number. This means that, for , there exists an such that . Thus,
[TABLE]
Therefore, is an -net of . As a consequence,
[TABLE]
Taking expectation (only over the data used in ERM which is ), we have
[TABLE]
Here, .
The quantity (Term 1) can be further upper bounded using Theorem 7.34 of Steinwart and Christmann (2008), which we state below for completeness:
**Theorem 7.34. **(Entropy Numbers for Gaussian Kernels; Steinwart and Christmann (2008)) Let be a distribution on having tail exponent , and be the RKHS endowed with the Gaussian kernel with bandwidth . Then, for all and , there exists a constant such that
for all and .
Since is induces by the Gaussian kernel, we have
[TABLE]
From Lemma 6.21 and Excercise 6.8 of Steinwart and Christmann (2008), we have
[TABLE]
where is the -covering number of a space with respect to a metric ; see Definition 6.19 of Steinwart and Christmann (2008) for details. Therefore, to derive the upper bound of the right hand side (C.3), it suffices to characterize an upper bound of . In order to do so, let be the all continuous functions from with bounded -th derivative. We further introduce Theorem 2.7.1 of van der Vaart and Wellner (1996) and an extension of Lemma 7 of Haussler (1992) which are provided below for completeness:
Theorem 2.7.1. (van der Vaart and Wellner (1996)) Let be a bounded, convex subset of with nonempty interior. There exists a constant depending only on and such that
(21)
Extension of Lemma 7 (Haussler (1992))
(22)
We briefly show (22). Let be an -net of having elements. Let having by construction. It suffices to show that is an -net of . We arbitrarily take any function . Then, we can find such that , which implies
[TABLE]
Therefore, .
Combining equations (C.3)-(22), we establish that , which is essentially the same as the result of Theorems 6.26, 6.27, 7.33 of Steinwart and Christmann (2008). Therefore, the remainder of the proof of Theorem 7.34 of Steinwart and Christmann (2008) applies to our setting. Therefore, from Theorem 7.34 of Steinwart and Christmann (2008), we get
[TABLE]
for all and . which satisfies condition (18) with . This further implies the following result:
[TABLE]
The rest of the proof of Theorem C.1 is the same as that in Steinwart and Christmann (2008), which results in
[TABLE]
Since the above result holds for any , we can further bound by carefully choosing in a way presented in Eberts and Steinwart (2013). We first define a function as
[TABLE]
for and . If , we can define by convolving with as follows (Eberts and Steinwart, 2013).
[TABLE]
Next, we introduce Theorems 2.2 and 2.3 of Eberts and Steinwart (2013).
**Theorem 2.2. ** Let us fix some . Furthermore, assume that is a distribution on that has a Lebesgue density for some . Let be such that . Then, for , , and with , we have
\displaystyle\big{\|}Q_{r,\gamma}*f-f\big{\|}_{L_{q}(P_{X})}^{q}\leq C_{r,q}\cdot\big{\|}p_{X}\big{\|}_{L_{p}(\mathbbm{R}^{d})}\cdot\omega_{r,L_{qs}(\mathbbm{R}^{d})}^{q}(f,\gamma/2)
where is a constant only depending on and .
**Theorem 2.3. ** Let , be the RKHS of the Gaussian kernel with parameter , and be defined by (24) for a fixed . Then we have with
\displaystyle\big{\|}Q_{r,\gamma}*f\big{\|}_{\mathcal{H}}\leq\big{(}\gamma\sqrt{\pi}\big{)}^{-\frac{d}{2}}\big{(}2^{r}-1\big{)}\big{\|}f\big{\|}_{L_{2}(\mathbbm{R}^{d})}\ .
Moreover, if , we have |Q_{r,\gamma}*f|\leq(2^{r}-1)\big{\|}f\big{\|}_{L_{\infty}(\mathbbm{R}^{d})}.
As a result, we obtain
[TABLE]
The first equality is from the construction of . The first inequality is from Theorem 2.3 of Eberts and Steinwart (2013). The second inequality is from Lipschitz continuity of . The third inequality is from Theorem 2.2 of Eberts and Steinwart (2013) with and . The last inequality holds for some constant since implies from the definition of a Besov space.
Combining the results in (23) and (25), we have the following result.
[TABLE]
The latter inequality is from the fact that is proportional to . This completes the proof.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1American Association of Clinical Endocrinologists (2019) American Association of Clinical Endocrinologists (2019). Management of common comorbidities of diabetes.
- 2An and Tao (1997) An, L. T. H. and P. D. Tao (1997). Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization 11 (3), 253–285.
- 3Anderson et al. (2007) Anderson, J. L., B. D. Horne, S. M. Stevens, A. S. Grove, S. Barton, Z. P. Nicholas, S. F. Kahn, H. T. May, K. M. Samuelson, J. B. Muhlestein, and J. F. Carlquist (2007). Randomized trial of genotype-guided versus standard warfarin dosing in patients initiating oral anticoagulation. Circulation 116 (22), 2563–2570.
- 4Bang and Robins (2005) Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 (4), 962–973.
- 5Belloni and Chernozhukov (2011) Belloni, A. and V. Chernozhukov (2011). ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39 (1), 82 – 130.
- 6Belloni and Chernozhukov (2013) Belloni, A. and V. Chernozhukov (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2), 521 – 547.
- 7Blann et al. (2003) Blann, A. D., D. A. Fitzmaurice, and G. Y. H. Lip (2003). Anticoagulation in hospitals and general practice. BMJ 326 (7381), 153–156.
- 8Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning 45 (1), 5–32.
