Calibrating multiplex serology for Helicobacter pylori
Emmanuelle A. Dankwa, Martyn Plummer, Daniel Chapman, Rima Jeske, Julia Butt, Michael Hill, Tim Waterboer, Iona Y. Millwood, Ling Yang, Christiana Kartsonaki

TL;DR
This study explores using classification algorithms to improve the accuracy of H. pylori infection detection through multiplex serology tests.
Contribution
A novel approach is proposed to calibrate multiplex serology tests using classification algorithms for H. pylori infection detection.
Findings
All tested models showed high discriminative ability with at least 95% AUC estimates.
There was no substantial difference in model performance between training and test data.
The study provides insights into the relative strengths of antigens in detecting H. pylori infection.
Abstract
Helicobacter pylori (H. pylori) is a bacterium that colonizes the stomach and is a major risk factor for gastric cancer, with an estimated 89% of non-cardia gastric cancer cases worldwide attributable to H. pylori. Prospective studies provide reliable evidence for quantifying the association between gastric cancer and H. pylori, as they circumvent the risk of a false negative due to possible reduction in antibody levels before cancer development. In a large-scale prospective study within the China Kadoorie Biobank, H. pylori infection is being analysed as a risk factor for gastric cancer. The presence of infection is typically determined by serological tests. The immunoblot test, although well established, is more labour intensive and uses a larger amount of plasma than the alternative high-throughput multiplex serology test. Immunoblot outputs a binary positive/negative serostatus…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2- —http://dx.doi.org/10.13039/501100024811Nuffield Department of Population Health, University of Oxford
- —http://dx.doi.org/10.13039/501100000289Cancer Research UK
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHelicobacter pylori-related gastroenterology studies · Veterinary medicine and infectious diseases · Viral gastroenteritis research and epidemiology
Introduction
Helicobacter pylori (H. pylori) is a bacterium that colonizes the stomach. It causes gastric cancer and gastric mucosa-associated lymphoid tissue (MALT) lymphoma [16] and may be associated with colon cancer [33]. Globally, H. pylori causes 810,000 cancer cases annually [22].
Accurate assessment of H. pylori infection status has been a major methodological challenge in epidemiological studies of gastric cancer, where infection status is typically evaluated with serologic assays that measure IgG antibodies to H. pylori antigens. Advanced atrophic gastritis, a precursor of gastric cancer, can cause a reduction in antibody levels [18] which may lead to a negative test result in a cancer case despite a history of infection with H. pylori. Such false-negative findings attenuate the apparent association between H. pylori and gastric cancer. For this reason, the most reliable evidence for quantifying the relationship between H. pylori and gastric cancer comes from prospective studies with blood collection many years before diagnosis [16]. One such study is the China Kadoorie Biobank.
The China Kadoorie Biobank (CKB) [7] is a prospective study of half a million Chinese adults aged 30–79 years at the time of recruitment (2004–2008) which aims to analyse risk factors associated with many chronic diseases in the Chinese population. Infectious agents in the CKB population are measured by a custom multiplex serology panel that simultaneously measures 43 antigens for 19 pathogens [37] including 12 antigens for H. pylori [24].
In the present study, we consider the calibration of multiplex serology results for H. pylori in the CKB. The multiplex assay produces median fluorescence intensity (MFI) measurements on 12 H. pylori antigens. These 12 continuous measurements must be translated to a single binary positive/negative classification. In previous applications, this has been done by choosing cut-offs for each antigen and then classifying an individual as positive if four or more antigens had MFIs above their respective cut-offs [24]. This is not an optimal use of the test results. The practice of dichotomizing continuous predictors in biomedical research has been criticized on the grounds that it throws away useful information [1, 21, 32]. In addition, the decision rule that classifies an individual as positive if four or more antigens are above their respective cut-offs puts all antigens on equal footing. It is more likely that some antigens provide stronger evidence of H. pylori infection than others. In the present study, we investigate an alternative strategy using machine learning methods to optimize the binary classification.
A case-cohort study of H. pylori infection and gastric cancer was conducted in the CKB [36]. All participants in the case-cohort study were tested with multiplex serology, and a subsample was additionally tested with immunoblot (Helicoblot 2.1, Genelabs Diagnostics) (Genelabs [11]). This created an opportunity to calibrate multiplex serology against immunoblot, which has shown promising results in previous studies. Immunoblot has shown stronger associations between H. pylori and gastric cancer compared to ELISA [27]. Additionally, immunoblot does not discriminate between past and current H. pylori infections [34]. This may be an advantage in epidemiological studies where past infections may still constitute an important risk factor for gastric cancer. However, immunoblot is not scalable to large biobank studies as it is slow, labour intensive and requires a relatively large amount of plasma or serum compared to multiplex serology.
Statistically, the calibration problem may be considered a supervised learning problem for which a number of different classification algorithms are available (e.g. logistic regression, random forest, elastic net, Bayesian additive regressive trees (BART) and multidimensional monotone BART). An appealing characteristic of classification algorithms for use in this context is their ability to account for the differences in strength of evidence of infection provided by the various antigens. By design, the output of a classification model is based on predictor (here, antigen) weights which explain the relative influence of a predictor on the outcome (here, H. pylori serostatus). Also, classification models output probabilities, which allow for pragmatic cut-off points to be later decided, if necessary. We however provide predictions as probabilities due to the downsides of dichotomization, earlier referenced [1, 21, 32].
The work is structured as follows. In the “Methods” section, we introduce key concepts, the data sets considered, the algorithms compared, the performance metrics employed in evaluating prediction performance and the methods used in evaluating antigen importance. The “Methods” section concludes with an outline of the analysis workflow. In the “Results” section, we present study results and conclude in the “Conclusions” section with a discussion of key results.
Methods
In this section, we discuss five main themes: (1) the data sets utilized, (2) the classification algorithms compared, (3) antigen importance, (4) performance metrics used in assessing the models’ predictive performance and (5) the overall analysis workflow. All analyses were performed in R (version 4.0.5) [28].
Data
The data used in this study are quantities from the immunoblot and multiplex serology tests on China Kadoorie Biobank participants. All data were anonymized.
Serologic tests
Immunoblot
The assay and criteria for determining H. pylori seropositivity in the China Kadoorie Biobank, based on immunoblot, have been described in detail elsewhere [36]. The test detects antibodies to seven H. pylori antigens, including cytotoxin-associated gene A (CagA), vacuolating cytotoxin A (VacA) and Urease A (UreA) and yields one of three outcomes: (1) H. pylori seropositive, (2) H. pylori seronegative and (3) ambiguous, suggesting an inconclusive result. For the purposes of this study, we considered ambiguous results to be negative. The accuracy of the immunoblot test used in this study (Helicoblot 2.1) has been evaluated by Monteiro et al. [25], Veijola et al. [34] and Biranjia-Hurdoyal and Seetulsingh-Goorah [3]. Monteiro et al. [25] found that immunoblot correctly classified 51/52 (98%) H. pylori-positive patients and 55/64 (86%) H. pylori-negative patients using PCR in gastric biopsies as a gold standard. Veijola et al. [34] found correct classification in 27/27 (100%) patients with current H. pylori infection and 19/20 (95%) patients with no known infection using histologic stains in gastric biopsies as a gold standard. In addition, they found a positive immunoblot result for 77/84 (92%) patients with past H. pylori infection, leading them to conclude that immunoblot does not distinguish past and current infection. Biranjia-Hurdoyal and Seetulsingh-Goorah [3] reported sensitivity of 98.9% and specificity of 70.8% against stool samples with HpStaR as a gold standard.
H. pylori multiplex serology
Multiplex serology quantifies the level of antibody response to 13 H. pylori antigens. The test procedure is described elsewhere [24]. Briefly, recombinantly expressed H. pylori proteins were loaded separately onto fluorescence-labelled beads. After the antigen-loaded beads had been mixed, they were incubated with 1:1000 diluted sera. A Luminex 200 analyser then identified the bead type and consequently the bound antigen and simultaneously quantified the fluorescent signal corresponding to the amount of antibodies bound to the antigen-loaded beads. The output is given as the median fluorescence intensity (MFI) of at least 100 beads.
The proteins considered in the multiplex assay used in the China Kadoorie Biobank are as follows: CagA, VacA, chaperonin GroEL (GroEL), HP1564 (Omp), neuraminyllactose-binding hemagglutinin homolog (HpaA), hydantoin utilization protein (HyuA-C), cinnamyl alcohol dehydrogenase ELI3-2 (Cad), urease alpha subunit (UreA), neutrophil-activating protein (NapA), conserved hypothetical secreted protein (HcpC), catalase, and a hypothetical protein (HP0305). These antigens were selected [24] based on known immunogenicity (UreA, GroEl, NapA, CagA, VacA, HpaA) or suggested serological association with gastric disease (Cad, HyuA, Omp, HcpC, HP0305). Michel et al. [24] reported a sensitivity of 89% and a specificity of 82% against commercial assays (screening ELISA with Western blot confirmation).
We note that the two assays measure different antigens, with some overlap. This raises the question of whether one assay can be used to calibrate the other. We proceed on the basis that the underlying true state we are trying to assess is binary presence or absence of H. pylori infection. The individual antigens are indirect indicators of this underlying dichotomy.
Training and case data
The case-cohort study in the China Kadoorie Biobank excluded subjects who satisfied any of the following criteria: (i) Unavailable blood samples, (ii) cancer diagnosis prior to baseline, (iii) cancer diagnosis within the first 2 years of follow-up, (iv) death or loss to follow-up within the first 2 years of follow-up and (v) inconsistencies in data [36]. Here, we refer to the cohort of subjects which remained after the application of this exclusion criteria as the eligible cohort. In keeping with the design of a case-cohort study [20], a random subset of 2000 participants, stratified by age and sex, was selected from among the eligible cohort to form the subcohort.
Both immunoblot and multiplex serology were conducted on all cases of gastric cancer in the eligible cohort as well as on a fraction of subjects in the subcohort. The data from the serologic tests on this fraction of the subcohort (N = 498) was used for model training. Immunoblot testing was only feasible on a small number of participants due to its high cost. We considered the subcohort data too small to split into training and validation sets and therefore relied on cross-validation to assess the prediction error in the subcohort data as described below. The predictive accuracy was also tested on the gastric cancer cases (N = 922). Note however that the cases are expected to have a much higher prevalence of H. pylori than the subcohort data due to the known association of H. pylori with gastric cancer risk.
Antibody reactivities, by multiplex serology, were used as predictors (input variables) for H. pylori seropositivity by immunoblot (outcome variable). Thus, each data point used in model training and evaluation is of the form of Eq. (1):
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left({y}_{j},{x}_{1j},\dots {x}_{kj}\right), j = 1,\dots N,$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{j}$$\end{document} is the H. pylori infection status of subject j, taking on the value 0 if subject j is H. pylori seronegative and the value 1 if participant j is H. pylori seropositive. The value \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{ij}$$\end{document} represents antibody reactivity corresponding to antigen \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i (i = 1,...,12)$$\end{document} for participant j measured by multiplex serology. Prior to analysis, all predictor variables were converted to the log scale to control for the large variation observed within variables and standardized to have sample mean 0 and variance 1 across individuals.
Algorithms
Here, we give brief theoretical overviews of all algorithms considered. Our main focus is on algorithms based on decision trees as these are the natural generalization of the classification algorithm of Michel et al. [24]. Recall that Michel et al. [24] classified a multiplex test result as positive if four or more antigens were above their respective cutoff levels. A decision tree is an algorithm based on recursively splitting the data into two groups based on one of the predictor variables \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${x}_{i}$$\end{document} according to whether it is above or below a given threshold. This can be represented as a tree in which nodes represent splitting rules and branches represent different paths taken by different values of the predictor variables \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{x}}$$\end{document} . Given predictor variables \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varvec{x}}}_{{\varvec{i}}}$$\end{document} for individual \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document} , we follow a path from the root node to one of the terminal nodes, which determines the prediction for the outcome variable \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varvec{Y}}}_{i}$$\end{document} .
An individual decision tree is highly sensitive to changes in the training data and has large generalization error. Here, we use classifiers based on ensembles of trees, which are combined to produce a strong learner. We consider the random forest, Bayesian additive regression trees (BART) and multidimensional monotone BART (mBART) algorithms. We also consider logistic regression, with and without regularization, as this is one of the most commonly used algorithms in biostatistics. We distinguish between an algorithm and a model: an algorithm is a distinct procedure or method, whereas the model is the result of implementing the algorithm using data. Details of all algorithms are in the supplementary material.
Antigen importance
We examined how insights on the influence of antigens on H. pylori serostatus may be gained from the models developed here. We were interested in identifying the important antigens for H. pylori serostatus prediction according to each model and studying the extent to which different models agree on their importance.
For logistic regression, the influence of a predictor on an outcome is measured by the statistical significance and size of the coefficient estimate of the predictor variable [26]. In particular, predictors with large and statistically significant coefficients are considered highly influential, whereas predictors with small coefficients are considered less important for explaining the outcome.
In a random forest, the importance of a predictor may be estimated as a measure of the decrease in prediction accuracy when that predictor is excluded from the model [5]. This is based on the principle that the exclusion of a predictor which highly influences the outcome is expected to result in a substantial decrease in prediction accuracy. Details are in the supplementary material.
For BART, we assessed the importance of a predictor \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varvec{x}}}_{i}$$\end{document} by estimating the proportion of times it is used in splitting a node; this is termed the inclusion proportion of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\varvec{x}}}_{i}$$\end{document} [17]. The principle underlying this method is the fact that influential variables (i.e., antigens) are more likely to be employed in node splitting due to their high relatedness with the outcome. Hence, predictors with higher inclusion proportions are considered more important for explaining the outcome.
Using BART antigen importance results as the baseline, we studied the extent to which the three models agree on variable importance, using the following steps.
- For each model as follows:
- Compute antigen importance scores and rank predictors in order of decreasing score.
- To each predictor, assign a rank value based on the predictor’s position in the ordering obtained at 1a. Assign ranks 1, 2 and 3 to the top four, middle four, and last four predictors, respectively.
- For each predictor, calculate the following:
- The absolute difference between its rank according to the logistic model and its rank according to BART
- The absolute difference between its rank according to random forest and its rank according to BART
For a given predictor, the possible values for the absolute difference in ranking between any two of the models being compared are 0, 1 and 2, which we interpret as “perfect agreement”, “moderate agreement” and “poor agreement”, respectively. The lower the absolute rank difference, the higher the agreement on antigen importance between the two models.
The elastic net and mBART models were not considered in the antigen importance analysis. The coefficients from the elastic net model are not interpretable as measures of antigen importance in the presence of interaction terms. For mBART, as far as we know, the implementation available from the authors does not allow for the estimation of variable importance. At the time of writing, we are not aware of any implementations of mBART which permit this analysis.
Performance metrics
All of our models create probabilistic predictions. Instead of producing a binary prediction \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{i}=1$$\end{document} or \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{i}=0$$\end{document} for the immunoblot serostatus, each model produces a probability estimate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\widehat{p}}_{i}$$\end{document} that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Y}_{i}=1$$\end{document} based on the model fitted to the training data. To compare the performance of the five models, we employed three performance metrics: the Brier score, the logarithmic score and the area under the receiver-operator characteristic curve. As each of these performance metrics has its downsides, we chose to use all three in order to obtain a balanced assessment of the methods considered.
For N observations in the testing set, let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{p}}\boldsymbol{ }= \{{\widehat{p}}_{1},{\widehat{p}}_{2},...,{\widehat{p}}_{N}\}$$\end{document} be a vector of probabilistic predictions by a classification model and let \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{Y}}\boldsymbol{ }= \{{Y}_{1},{Y}_{2},...,{Y}_{N}\}$$\end{document} be the binary outcome representing H. pylori serosatus according to immunoblot.
Brier score
The Brier score [6] is the average of the squared prediction errors and is given by the following:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{N}\sum\limits_{i=1}^{N}{\left({\widehat{p}}_{i}-{Y}_{i}\right)}^{2} .$$\end{document}The Brier score is negatively oriented (lower scores are better) and assumes values in the interval [0*,*1]. It is a strictly proper scoring rule [35], i.e., it is maximized if and only if the predicted probability distribution exactly matches the true (observed) distribution [12]. Here, we use the Brier score as the primary evaluation metric. That is, for a given problem, a prediction with the largest (resp. smallest) probability for an individual who is indeed H. pylori seropositive will be rewarded (resp. penalized) the most. A downside of the Brier score is its low sensitivity to small changes in extreme predictions [2].
Logarithmic score
The logarithmic score [13] (also “ignorance” score or log loss) is a measure of the cross-entropy (Kullback–Leibler loss) between the prediction distribution and the true distribution of events [2, 29]. The logarithmic score is given by the following:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\frac{1}{N}\sum\limits_{i=1}^{N}\left[{Y}_{i}\text{log}{\widehat{p}}_{i}+\left(1-{Y}_{i}\right)\text{log}\left(1-{\widehat{p}}_{i}\right)\right] .$$\end{document}The logarithmic score takes on values in the interval \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0, \infty ]$$\end{document} . Like the Brier score, it is negatively oriented and is a proper scoring rule. The logarithmic score has been criticized for its hypersensitivity to negligible differences between small predicted probabilities and for its low sensitivity, under some conditions, to changes in the relationship between the observed and predicted distributions [31].
Area under the curve
The area under the curve (AUC) refers to the area under the receiver operating characteristic (ROC) curve. The ROC curve is a two-dimensional plot of the true-positive rate (sensitivity) against the false-positive rate (specificity) for a given classifier [23]. The AUC summarizes the discriminative ability of a classifier [9, 15]. In the current context, the AUC is equivalent to the probability that the classifier ranks a random H. pylori seropositive observation higher than a random H. pylori seronegative observation. The AUC takes values in the interval [0,1] for a ROC plot with unit area. The closer the AUC is to 1, the better the discriminative ability of the classifier. A classifier with AUC equal to 0.5 possesses the same discriminative ability as a random classifier.
The AUC is unaffected by arbitrary classification threshold choices as it is an average across a number of thresholds [9]. However, the AUC is incoherent in its treatment of misclassification costs across different classifiers and can be misleading when ROC curves cross [14] or when there are severe class imbalances [30].
Analysis workflow
To compare the performance of the five classification methods and to gain insights into the influence of individual antigens on H. pylori status, we used an analysis workflow with three stages summarized as follows: (1) Compare predictive performance of models on training data, (2) compare predictive performance of models on case data and (3) study antigen importance.
Stage 1: Compare predictive performance of models on training data
We trained all models via tenfold cross-validation (CV) and computed performance scores at each CV iteration. The training data (N = 498) was divided into 10 approximately equally sized parts (folds) such that all folds had approximately the same distribution of H. pylori positive and negative individuals as the training data. This was to ensure that models were trained on samples that were representative of the entire training data. The same folds were used for all models to ensure fairness.
Except for logistic regression which has no hyperparameters, all models used fivefold nested CV for selection of the best hyperparameter values at each iteration of the tenfold CV. This ensured better prediction accuracy and greater flexibility than would have achieved in a no-hyperparameter-tuning case, since models could adapt to the different training folds. In the BART and elastic net models, hyperparameters were selected by minimizing the root-mean-square error [17] and the misclassification error, respectively [10]. For random forest and mBART, hyperparameters were chosen to maximize the classification accuracy [19].
For each model, this analysis yielded 10 estimates each of the Brier score, logarithmic score and AUC. The mean and standard deviations of these estimates across folds provided a measure of comparison of predictive performance.
Stage 2: Compare predictive performance of algorithms on case data
The predictive performance of all algorithms was tested on the case data (N = 922). The algorithms were first trained via tenfold CV on the training data to allow the selection of the best hyperparameter values. Unlike at stage 1, the folds used in training at this stage were not fixed, as comparison on the training data was not the aim here. For each model, we used the best hyperparameter values from the tenfold CV to predict H. pylori serostatus in the case data. Given the predictions, we computed performance estimates for each model.
Stage 3: Study antigen importance
Using the best logistic regression, random forest and BART models on the training data, we computed antigen importance measures for each antigen and compared results across models.
For logistic regression, the absolute standardized regression coefficients and corresponding standard errors were considered as antigen importance estimates. The level of statistical significance for all coefficient estimates was α = 0*.*05. For random forest, the decrease in accuracy for each predictor in the random forest model was computed and averaged over 500 trees using the steps outlined in the “Antigen importance” section. For BART, we computed the inclusion proportions for all predictors in 100 replications of the model with m = 20 trees in each replication. The small number of trees was to ensure competition among predictors for inclusion in splitting rules, such that highly influential predictors (which yield greater accuracy when used for mode splitting) will tend to be preferred and hence have higher inclusion proportions than unimportant predictors [4, 8]. For each predictor, we computed the average inclusion proportion and the standard deviation of inclusion proportions across the 100 model replications.
To determine the extent of agreement between models on antigen importance, we computed the absolute differences in antigen rankings between the BART (baseline), random forest and logistic regression models, using the ranking method described in the “Antigen importance” section.
Results
The seroprevalence of H. pylori infection in the training and case data sets was 75.7% and 93.5%, respectively. We report training, validation and antigen importance results according to the analysis workflow.
Stage 1: Performance of algorithms on training data
Estimates of the predictive performance of the final calibrated models — that is, the models yielding the best performance on the training data — are shown in Table 1. We present the mean CV scores across the 10 folds \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm$$\end{document} one standard deviation to represent the variation in scores across folds. The hyperparameter values of the final calibrated models are reported in Tables S1, S2 and S3 and Figure S1 of the supplementary material, along with the predictive accuracy scores (with standard deviations) of all other hyperparameter values considered in the calibration process. Table 1. Predictive performance of the logistic regression, elastic net, random forest, Bayesian additive regression trees (BART) and multidimensional monotone BART (mBART) models on the training data (N = 498). The table displays mean \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm$$\end{document} one standard deviation of tenfold CV scores. Performance metrics considered were the Brier score, logarithmic score and the area under the receiver operating characteristic curve (AUC). The best score for each metric is in boldPerformance metricAlgorithmBrier scoreLogarithmic score****AUCLogistic regression0.0801±0.020.2535±0.070.9513±0.03Elastic net0.0765±0.020.2441±0.040.9523±0.02Random forest0.0767±0.030.2560±0.080.9581±0.03BART0.0746±0.020.2454±0.060.9583±0.03mBART0.0774±0.010.2552±0.040.9559±0.02
Overall, BART performed best according to the Brier score and AUC metrics, while elastic net had the best logarithmic score. Using the Brier score, the ranking of models in order of decreasing classification performance on the training data is as follows: BART, elastic net, random forest, logistic regression and mBART.
Stage 2: Performance of algorithms on case data
Estimates of the predictive performance of the final calibrated models on the case data are provided in Table 2. BART had the best performance according to the Brier score (0.0745) and the logarithmic score (0.2313), while elastic net had the highest AUC (0.9716). Logistic regression had the lowest performance on both the Brier and logarithmic scores, although on the AUC it performed better than random forest and BART. There are no error estimates associated with the performance estimates on the case data because, unlike the training data, performance estimates were not computed on different data folds. Table 2. Performance scores of the logistic regression, elastic net, random forest, Bayesian additive regression trees (BART) and multidimensional monotone BART (mBART) models on the case (test) data (N = 922). For all algorithms except logistic regression, the predictive model was selected via tenfold cross-validation on the training data. Performance metrics considered were the Brier score, logarithmic score and the area under the receiver operating characteristic curve (AUC). The best score for each metric is in boldPerformance metricAlgorithmBrier scoreLogarithmic score****AUCLogistic regression0.08760.29910.9695Elastic net0.08040.26930.9716Random forest0.07760.25220.9499BART0.0745****0.23130.9558mBART0.07520.23720.9573
Stage 3: Antigen importance
Antigen importance scores with corresponding variance estimates for all antigens are shown for BART (Fig. 1A), random forest (Fig. 1B) and logistic regression (Fig. 1C) trained on the complete training data.Fig. 1. Antigen importance measures based on the Bayesian additive regression trees (BART), random forest and logistic regression models trained on the complete analysis cohort data. A Mean inclusion proportions for predictors, averaged across 100 replications of a BART model, each with 20 trees. Error bars represent standard deviations of estimates. B Mean decrease in accuracy corresponding to each predictor, averaged over 500 trees in the random forest. Error bars represent standard deviations of estimates. C Absolute standardized regression coefficients for the logistic regression model. Error bars represent the standard errors of coefficient estimates
The absolute differences in antigen importance rankings between BART (baseline), random forest and logistic regression are shown in Fig. 2. There was perfect agreement in the ranking of the four most influential antigens, as all models ranked CagA, HcpC, GroEL and Omp in the top 4. Also, the three models agreed on the ranking of two of the least influential antigens: UreA and HyuA.C. The BART rankings agreed with both random forest and logistic regression rankings for 8 of 12 antigens ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document} 67%), namely CagA, HcpC, GroEL, Omp, HP0305, NapA, Cad and HyuA.C. The poorest ranking agreement between BART and the random forest was observed with UreA, Catalase, VacA and HpaA. In all three models, CagA had the highest influence on H. pylori serostatus.Fig. 2. Absolute differences in antigen importance ranking by logistic regression and random forest, relative to Bayesian additive regression trees (BART) rankings. Lower absolute differences indicate higher rank agreement between methods: a rank of “0” (purple) indicates no difference with BART rankings, and a rank of “1” (yellow) indicates a difference in one rank level between the model's ranking and BART's rankings
Conclusions
We have compared the performance of five classification algorithms — logistic regression, elastic net, random forest, BART and mBART — on H. pylori serologic data in the China Kadoorie Biobank. All trained models showed high discriminative ability, as all AUC estimates on the training and case data were at least 95% (Tables 1 and 2). The best discriminative ability (AUC \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=9$$\end{document} 7%) was exhibited by the mBART model on the case data. We did not observe a substantial difference between the performance of models on the training data and their performance on the case data.
By the Brier score, BART had the best predictive performance on both the training and case data sets, although the differences between scores of the different models were small (Tables 1 and 2). The superior performance of the BART, mBART and random forest models in comparison to logistic regression is likely due to their ability to automatically account for interactions between predictors. Although logistic regression models can include interaction terms, we considered that without regularization, this would introduce instability in the model. Interaction effects were included in the elastic net model which showed a slightly better performance than logistic regression but a generally lower performance compared to the tree-based methods (BART, mBART and random forest). The automatic incorporation of predictor interactions is an appealing property of tree-based methods. Although not utilized here, the Bayesian methods (BART and mBART) also have the advantage of adapting to changes in prior information on the relationships between predictors [8].
The BART, random forest and logistic regression models showed high levels of agreement in antigen importance ranking, with an agreement rate of 67% on the 12 antigens considered (Fig. 2). These models agreed on CagA as the most influential antigen for the determination of H. pylori serostatus and on Cad and HyuA.C as among the least influential. The result for CagA is consistent with findings from previous studies. An earlier study on data from the China Kadoorie Biobank showed CagA as having the strongest associations with the risk of gastric cancer [36]. Additionally, Michel et al. [24] found CagA as the most immunogenic antigen among 17 H. pylori proteins and observed a high agreement for CagA (kappa: 0.77; 95% confidence interval (CI): 0.69–0.86) between immunoblot and multiplex serology results on 227 reference sera. Both assays used in this study test for antibodies to CagA. In contrast, the VacA and UreA proteins that are also common to both assays were not highly ranked by antigen importance.
Contrary to our expectations, the imposition of monotonicity constraints on the BART framework did not lead to improvements in predictive ability. It is not yet clear why mBART did not outperform BART, although the current prediction problem satisfies the monotonicity condition that higher antibody reactivity always implies increased probability of infection.
The study could be improved in a number of ways. First, statistical tests could be performed to determine which differences in performance scores between algorithms were due to random chance. Given the small differences observed between scores, such insights are needed to further elucidate the differences in predictive ability between these algorithms. Second, scores on the case data could be reported separately by H. pylori serostatus to help determine whether some algorithms are particularly suited for prediction in specific subgroups. Finally, the study could include an analysis of algorithm runtimes vis-à-vis their prediction accuracy to further our understanding of accuracy–speed trade-offs involved in the application of these algorithms, especially on large data sets as those held in biobanks. As this study is a calibration of a diagnostic assay, we did not consider other potential predictors that are not considered in the diagnostic procedure, for example demographic variables.
While the calibrated models developed here could be used to classify serostatus for all participants in the China Kadoorie Biobank, they might fail to generalize outside of the study population. Obtaining more generalizable models would require a repeat of the calibration exercise in a suitably chosen random subsample of the new study population, as we would expect the serological profile of the population to be different. Two key features of the China Kadoorie Biobank population are relevant to the generalizability of these results. Firstly, this is a high-prevalence population with 75.7% H. pylori infection in the controls. Under these circumstances, the accuracy of any serological test depends on its ability to identify uninfected individuals, i.e. its specificity. In a low-prevalence population, the balance between sensitivity and specificity in terms of diagnostic accuracy will be different. Secondly, a high proportion of infections are CagA positive. Antibodies to CagA have been shown to persist after H. pylori eradication and are therefore considered a useful marker of lifetime exposure [34]. Both immunoblot and multiplex assays detect antibodies to CagA, which was the most important variable in the calibration of multiplex serology to immunoblot. It is not clear whether equally good calibration is attainable in a population infected by CagA-negative strains. In conclusion, we have demonstrated how classification algorithms can be used to calibrate the H. pylori multiplex serology test to the immunoblot test in the China Kadoorie Biobank. This study furthers our understanding of the applicability of classification algorithms to the context of serologic tests.
Supplementary Information
Additional file 1: Table S1: Average out-of-bagroot mean square error for different hyperparameter combinations for the BART algorithm on the training data. Scores are displayed for all hyperparameter combinations in order of increasing RMSE. Averages are over validation folds within a tenfold cross-validation scheme. The hyperparameters \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} represent the number of trees and the amount of prior information used in the model, respectively. The best hyperparameter combination, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{k=2, m=200\}$$\end{document} , was used for prediction among cases. Table S2: Average accuracy scores and standard deviations (SDs) for different hyperparameter combinations for the mBART algorithm on the training data**.** Averages and SDs are over validation folds within a tenfold cross-validation scheme. Scores are displayed for all hyperparameter combinations in order of decreasing accuracy. The hyperparameters \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} represent the number of trees and the amount of prior information used in the model, respectively. The best hyperparameter combination (that which yielded the optimal score across the 10 folds) was used for prediction in the case data. Due to rounding, the accuracy scores of the combinations \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left\{k=2, m=200\right\}$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{k=3, m=200\}$$\end{document} appear equal; however, the second set has a higher average score and a lower SD. Table S3: Average accuracy scores and standard deviations (SDs) for different hyperparameter combinations for the random forest algorithm on the training data. Averages and SDs are over validation folds within a tenfold cross-validation scheme. Scores are displayed for all possible hyperparameter values in order of decreasing accuracy. The hyperparameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${m}_{try}$$\end{document} represents the number of variables used for node splitting. The best hyperparameter combination (that which yielded the optimal score across the 10 folds) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{{m}_{try}=6\}$$\end{document} , was used for prediction in the case data. Figure S1: Misclassification error for different values of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} in the elastic net (EN) algorithm on the training data within a tenfold cross-validation (CV) scheme. Red dots and grey bars represent the average misclassification error and standard deviation of misclassification errors, respectively, across the 10 CV folds for each value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} . The dashed vertical lines are plotted at the value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} with the minimum average CV error (first dashed line from the left) and at the largest value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} such that the error is within one standard error of the minimum (second dashed line from the left). The numbers displayed at the top of the plot represent the number of non-zero coefficient estimates at the corresponding values of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} . The value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document} with the minimum average CV error \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(log(\lambda)= -3.93)$$\end{document} was used within the EN model for prediction on the case data. The value of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document} (EN mixing parameter) was fixed at 0.90 to allow more weight towards a lasso penalty ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document} = 1) while still allowing for a ridge penalty ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha=0$$\end{document} ) to control multicollinearity effects.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Altman, DG Royston P. ‘Statistics notes: the cost of dichotomising continuous variables’, BMJ : British Medical J. 2006;332(7549):1080. Available at: 10.1136/bmj.332.7549.1080.10.1136/bmj.332.7549.1080 PMC 145857316675816 · doi ↗ · pubmed ↗
- 2Biranjia-Hurdoyal SD, Seetulsingh-Goorah SP. ‘Performances of four Helicobacter pylori serological detection kits using stool antigen test as gold standard’. PLOS ONE. 2016;11(10): e 0163834. Available at: 10.1371/journal.pone.0163834.10.1371/journal.pone.0163834 PMC 506328827736910 · doi ↗ · pubmed ↗
- 3Kokkola A et al. ‘Spontaneous disappearance of Helicobacter pylori antibodies in patients with advanced atrophic corpus gastritis’. APMIS. 2003;111(6):619–624. Available at: 10.1034/j.1600-0463.2003.1110604.x.10.1034/j.1600-0463.2003.1110604.x 12969017 · doi ↗ · pubmed ↗
- 4de Martel C, et al. ‘Global burden of cancer attributable to infections in 2018: a worldwide incidence analysis’. Lancet Global Health. 2020;8:e 180–e 190. Available at: 10.1016/S 2214-109X(19)30488-7.10.1016/S 2214-109X(19)30488-731862245 · doi ↗ · pubmed ↗
- 5Metz CE. ‘Basic principles of ROC analysis’. Seminars in Nuclear Medicine. 1978;8(4):283–298. Available at: 10.1016/s 0001-2998(78)80014-2.10.1016/s 0001-2998(78)80014-2112681 · doi ↗ · pubmed ↗
- 6Monteiro L, et al. ‘Evaluation of the performance of the Helico blot 2.1 as a tool to investigate the virulence properties of Helicobacter pylori’. Clinical Microbiology and Infection. 2002;8(10):676–679. Available at: 10.1046/j.1469-0691.2002.00438.x.10.1046/j.1469-0691.2002.00438.x 12390288 · doi ↗ · pubmed ↗
- 7Nick TG, Campbell KM. ‘Logistic regression’. Topics in Biostatistics. 2007;404:273–301.10.1007/978-1-59745-530-5_1418450055 · doi ↗ · pubmed ↗
- 8R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. 2021. Available at: http://www.R-project.org/.
