HIV-phyloTSI: subtype-independent estimation of time since HIV-1 infection for cross-sectional measures of population incidence using deep sequence data

Tanya Golubchik; Lucie Abeler-Dörner; Matthew Hall; Chris Wymant; David Bonsall; George Macintyre-Cockett; Laura Thomson; Jared M. Baeten; Connie L. Celum; Ronald M. Galiwango; Barry Kosloff; Mohammed Limbada; Andrew Mujugira; Nelly R. Mugo; Astrid Gall; François Blanquart; Margreet Bakker; Daniela Bezemer; Swee Hoe Ong; Jan Albert; Norbert Bannert; Jacques Fellay; Barbara Gunsenheimer-Bartmeyer; Huldrych F. Günthard; Pia Kivelä; Roger D. Kouyos; Laurence Meyer; Kholoud Porter; Ard van Sighem; Mark van der Valk; Ben Berkhout; Paul Kellam; Marion Cornelissen; Peter Reiss; Helen Ayles; David N. Burns; Sarah Fidler; Mary Kate Grabowski; Richard Hayes; Joshua T. Herbeck; Joseph Kagaayi; Pontiano Kaleebu; Jairam R. Lingappa; Deogratius Ssemwanga; Susan H. Eshleman; Myron S. Cohen; Oliver Ratmann; Oliver Laeyendecker; Christophe Fraser

PMC · DOI:10.1186/s12859-025-06189-y·August 14, 2025

HIV-phyloTSI: subtype-independent estimation of time since HIV-1 infection for cross-sectional measures of population incidence using deep sequence data

Tanya Golubchik, Lucie Abeler-Dörner, Matthew Hall, Chris Wymant, David Bonsall, George Macintyre-Cockett, Laura Thomson, Jared M. Baeten, Connie L. Celum, Ronald M. Galiwango, Barry Kosloff, Mohammed Limbada, Andrew Mujugira, Nelly R. Mugo, Astrid Gall, François Blanquart

PDF

Open Access

TL;DR

This paper introduces HIV-phyloTSI, a method to estimate the time since HIV infection using deep sequencing data, enabling more accurate population-level tracking of the HIV epidemic.

Contribution

HIV-phyloTSI provides a continuous, subtype-independent time-since-infection estimation using within-host diversity and divergence from deep sequencing data.

Findings

01

HIV-phyloTSI estimates TSI up to 9 years with a mean absolute error of less than 12 months.

02

The method performs equally well across all major HIV subtypes in African and European cohorts.

03

It achieves less than 5 months error for infections within the first year.

Abstract

Estimating the time since HIV infection (TSI) at population level is essential for tracking changes in the global HIV epidemic. Most methods for determining TSI give a binary classification of infections as recent or non-recent within a window of several months, and cannot assess the cumulative impact of an intervention. We developed a Random Forest Regression model, HIV-phyloTSI, which combines measures of within-host diversity and divergence to generate continuous TSI estimates directly from viral deep-sequencing data, with no need for additional variables. HIV-phyloTSI provides a continuous measure of TSI up to 9 years, with a mean absolute error of less than 12 months overall and less than 5 months for infections with a TSI of up to a year. It performs equally well for all major HIV subtypes based on data from African and European cohorts. We demonstrate how HIV-phyloTSI can be…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Human immunodeficiency virus 1(no rank)

Diseases2

HIV infections

Figures8

Click any figure to enlarge with its caption.

Distribution of time since infection (TSI) and seroconversion intervals in the training datasets. Vertical lines show duration of the seroconversion interval, from three weeks prior to the last HIV-negative test to the date of the date of the first HIV-positive test. TSI was defined as the time in years between the sampling date and the midpoint of the seroconversion interval (circles)

Genetic divergence (LRTT) and diversity (MAF3c, MAF12c) values along the HIV genome in samples grouped by time since infection. Lines show mean root-to-tip distance of the largest subgraph (LRTT), and mean minor allele frequencies at the third codon position (MAF3c) and first and second positions (MAF12c). MAF3c and MAF12c measures shown as mean per 250 bp window. The shaded area indicates the width of the 95% bootstrap confidence interval around the mean

CD4 cell count, viral load and age can be informative. A Most recent CD4 cell count obtained closest to sequencing sampling date. B Viral load (log_10_ copies per mL) obtained closest to sequencing sampling date. C Distribution of age at sampling in different cohorts in the training data set

Model r^2^ and accuracy of identifying recent infections. Markers show mean and lines show 95% bootstrap CI over 20-fold cross-validation. A Model r^2^ score on test data. B Accuracy, defined as the proportion of samples having TSI correctly estimated as being below or above a cut-off of either 12 months (circles, blue) or 18 months (triangles, orange)

HIV-phyloTSI is more accurate on population level. A TSI estimates compared with known midpoint TSIs, with point estimates indicated as crosses and seroconversion intervals as lines. Regression line shown in black with confidence interval as shaded grey area. The green line (overlapping with the black line) indicates equalityFig. 6HIV-phyloTSI can predict recency at different cut-offs in a simulated population. Number of recent infections in a simulated population of 1000 individuals where average time to treatment is 3 years, with the recency cut-off varied between 3 and 26 months. The blue l

HIV-phyloTSI is unbiased with respect to subtype. A Model bias (difference between real and estimated TSI value) by square-root transformed time since infection, coloured by subtype. Circles indicate mean for groups of 0–1, 2–4, 5–9 and 10–16 years since infection, and vertical bars indicate the 95% bootstrap confidence interval in each group, for all subtypes. Lines connect group means to aid visualisation. B Residuals from the linear model used to adjust for TSI of samples (Bias ~ TSI), shown by subtype, for subtypes with at least 5 samples in the dataset. Boxes extend to first quartile from

Funding4

—http://dx.doi.org/10.13039/100000865Bill and Melinda Gates Foundation
—http://dx.doi.org/10.13039/100000060National Institute of Allergy and Infectious Diseases
—http://dx.doi.org/10.13039/501100000925National Health and Medical Research Council
—http://dx.doi.org/10.13039/501100000781European Research Council

Keywords

HIVNext-generation sequencingRandom forestRecency of infectionTime since infection

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHIV Research and Treatment · HIV/AIDS Research and Interventions · Animal Disease Management and Epidemiology

Full text

Introduction

Accurate estimates of HIV incidence are critical for surveillance of HIV epidemics and to determine the effectiveness of prevention efforts [1]. HIV incidence, the rate at which new infections arise from the susceptible part of the population, tracks the leading edge of the epidemic, measuring transmission at a given moment in time. The traditional longitudinal cohort approaches for measuring incidence are costly, and incidence derived from a cohort with many HIV prevention interventions is likely to give biased estimates [2–4].

A cross-sectional study design would be preferable for measuring incidence, and recent methodological advances have brought this possibility closer. Cross-sectional HIV incidence estimation is currently based on algorithms that use observable changes in the biomarker profile of individuals to class infections as recent or non-recent [4]. Identification of individuals with recent HIV infection allows one to estimate incidence by counting the number of such individuals within the population/cohort and adjusting for the time period associated with that recent state. Using the quality of the antibody response to HIV infection as a host biomarker, these methods have been used to estimate HIV incidence at a country level [5], within specific risk groups [6], as the primary outcome of intervention trials, and have been considered for counterfactual incidence estimation to determine the effectiveness of Pre-Exposure Prophylaxis (PrEP).

An alternative to host response biomarkers is measurement of the genetic diversity of HIV sequences found in individual infections. It has long been recognised that in the absence of treatment, HIV viral genetic diversity increases with duration of infection in a single host [7, 8], and several studies have used measures of HIV genetic diversity to estimate HIV incidence. These studies use several sequence-based and non-sequence-based parameters: nucleotide ambiguity [9, 10], Hamming distance—Q10 [11], generalized entropy [12], pairwise distances [13], time to most recent common ancestor (MRCA) [14] and high-resolution melting diversity assays [15]. However, no single approach has demonstrated sufficient precision for HIV incidence estimation.

In recent years, next-generation sequencing (NGS) data has shown promise for identifying recent infections from genetic sequence data alone. Puller et al. showed that a new NGS-based measure (3rd base ambiguity in the pol gene) can be used to estimate time since HIV-1 infection (TSI) many years after the infection, in contrast to most alternative biomarkers. The study was based on data from 42 patients in [16] and is yet to be validated in a larger population. TSI has also been estimated using average pairwise diversity [17], and a combination of pairwise sequence diversity and divergence in sub-sequences within pol and env in 30 individuals [18]. These studies also show that sequence methods have the potential to estimate TSI for the entire duration of infection, rather than classifying individuals as ‘recent’ or ‘non-recent’. A seminal study from 1999 demonstrated that “DNA Distance”, or longest root to tip distance (LRTT), increased almost linearly with time since infection (TSI) [7, 8]. However, LRTT relies on multiple sequences from a single time point, which historically has been costly to compute and therefore seen as impractical for population studies. This is changing with the widespread adoption of next-generation sequencing.

In this study, we present a new method for estimating TSI of all major HIV-1 subtypes from deep sequence data, using both LRTT and diversity: HIV-phyloTSI. HIV-phyloTSI employs a machine-learning algorithm, using data from four different cohorts representing 480 individuals with known approximate TSI as a training dataset. Performance was evaluated in two ways: first, using a simulated dataset, and second, using a dataset from the HPTN-071 (PopART) Phylogenetics Ancillary study [19, 20] The output of HIV-phyloTSI are continuous estimates of TSI, rather than a binary classification of samples as recent and non-recent, and the method is sufficiently accurate to estimate HIV incidence in a cross-sectional cohort without use of additional demographic variables.

Results

Participants and datasets

HIV genome sequences were made available by the PANGEA and BEEHIVE consortia following consortium-specific guidelines for data access. PANGEA [21] is a network of collaborators from Africa, Europe and the United States (US) who have generated a large number of HIV NGS sequences from Eastern and Southern Africa. BEEHIVE [30] is a network of HIV cohorts that collected HIV samples from HIV seroconverters across Western Europe (Finland, France, Germany, the Netherlands, Switzerland, United Kingdom) and in Uganda. The samples used for this study were collected in Uganda, Kenya and South Africa by MRC Uganda/UVRI [22–24] (MRC), the University of Washington International Clinical Research Centre programs [25–27] (UWP) and the Rakai Community Cohort Study [28, 29] (RAK), and the UK cohort from BEEHIVE (BEE). Participants in all studies were treatment-naive and viraemic at the time of sampling. Participants in all studies gave written consent and ethics approvals were granted to the institutions that generated the data. Participant characteristics for the different datasets are listed in Table 1.Table 1. Participant characteristics in training datasetUW clinical research centre programsRakai community cohort studyBEEHIVEMRC-UVRI Uganda(UWP)(RAK)(BEE)(MRC)Participants114152113101Samples selected from cohort160152113102Age, median (range)34.7 (21.9–55.5)32.0 (17.0–50.0)33.1 (20.4–62.8)30.2 (17.0–75.4)Proportion female0.420.510.030.46Years since infection, mean (range)0.4 (0.0–1.4)2.5 (0.6–17.9)0.7 (0.1–2.1)0.6 (0.1–21.0)Proportion of infections < = 12 months0.840.240.60.59 < = 24 months1.000.340.950.64Predominant subtypesA1, D, CD, A1, CB, C, F1D, A1, Blog10 viral load mean (range)4.9 (3.4–6.8)3.1 (0.0–6.4)4.6 (3.1–5.6)4.3 (2.9–5.2)CD4 cell count mean (range)503 (33–1664)442 (15–1626)501 (140–1062)623 (197–1597)* CD4 cell count data and viral load data obtained closest to the date of the sequencing sample were available for 84% of samples (443/527) and 58% of samples (308/527)

Datasets were chosen to include participants with known seroconversion intervals (UWP and BEE) and participants with recent as well as participants with non-recent infections (RAK and MRC) (Fig. 1). Participants with long seroconversion intervals, which were almost entirely long-term infections, were retained to avoid model bias towards recent infections (class imbalance). No other pre-selection was made, but as only sequenced genomes were included, this of necessity implies detectable viral load (above 10^2^ copies per ml). Cohorts from different geographical locations were included to ensure a representation of subtypes A, B, C, and D. In total, the training dataset consisted of 527 sequences from 480 participants (Table 1). Samples from the same individual were retained, as there was no significant difference in their pairwise similarity across predictor metrics, compared with other samples from the same dataset and TSI range (0–1.15 years, median sampling interval 0.7 years; median Spearman correlation 0.79 and 0.78 for samples from the same person vs matched unique samples).Fig. 1. Distribution of time since infection (TSI) and seroconversion intervals in the training datasets. Vertical lines show duration of the seroconversion interval, from three weeks prior to the last HIV-negative test to the date of the date of the first HIV-positive test. TSI was defined as the time in years between the sampling date and the midpoint of the seroconversion interval (circles)

HIV diversity and divergence increase with the duration of infection

Minor allele frequency (MAF) has been shown to increase over the course of an HIV infection and has previously been used to identify recent infections [9, 10, 16, 17, 31]. In this study, we split MAF into two variables, designated MAF12c for the first two codon positions and MAF3c for the third codon position, to allow for their different evolutionary rates. Both MAF12c and MAF3c increased with TSI (Fig. 2), at rates that varied by HIV gene. The most rapid increase was observed for MAF3c in the gp120 and gp41 regions; this is consistent with the high substitution rate in the env gene relative to the more constrained pol and gag genes. MAF3c was informative across the genome, particularly in gag, pol and gp120 (Figure S1), while MAF12c remained highly constrained in gag and pol, but was markedly more informative in gp120 (Figure S1). We hypothesized that estimates could be improved by using more complex statistics derived from phylogenetic trees. We therefore constructed intra-host diversity trees in sliding windows along the genome using *phyloscanner *[32] and tested if pairwise patristic distance, overall root-to-tip length or longest root-to-tip distance (LRTT) were predictors of TSI (Figure S2). LRTT performed best, and was included alongside MAF12c and MAF3c as a predictor in this study. LRTT was particularly informative in gp120 and for non-recent infections (Fig. 2). The performance of all three predictors was not influenced by the sequencing method used (amplicon-based or probe-based) (Figure S3).Fig. 2. Genetic divergence (LRTT) and diversity (MAF3c, MAF12c) values along the HIV genome in samples grouped by time since infection. Lines show mean root-to-tip distance of the largest subgraph (LRTT), and mean minor allele frequencies at the third codon position (MAF3c) and first and second positions (MAF12c). MAF3c and MAF12c measures shown as mean per 250 bp window. The shaded area indicates the width of the 95% bootstrap confidence interval around the mean

To determine whether some genomic locations were particularly informative, we assessed the correlation of MAF12c, MAF3c, and LRTT values across the HIV genome with the known TSI (Figure S4). Informativeness was more variable for the MAF predictors than for LRTT. MAF3c was most informative in the 3’ region of gag, the 5’ region of pol, and the 5’ region of gp120; MAF12c was most informative in the 5’ region of gp120; and LRTT was informative in all of gag, pol and gp120. The most informative windows, defined as having r^2^ over 0.3 for any predictor, were included as one of the feature combinations during model selection (LRTT_MAF3c_topwin). MAF3c alone was highly informative, but performance was improved by including other parameters.

In conclusion, LRTT adds a measure of TSI that is additive to MAF. Both predictors perform best in the gp120 region of the env gene, however, all regions of the genome are informative. A model should ideally aggregate information across all genomic regions to become more powerful.

CD4 cell counts and age are informative but are not essential for estimating TSI

Limited clinical data were available for the sequences in the training dataset. CD4 cell counts and viral load measurements obtained close to the date of the sequencing sample were available for 84% (443/527) and 58% (308/527) of the samples, respectively. Mean CD4 cell count and mean log10 viral load were significantly different between recent and non-recent infections (CD4 count, 543 v 451 cells/mm3; log10 viral load, 4.9 v 4.3; p < 0.001, Welch's t-test, with recency cut-off of 12 months) (Fig. 3A, B), but with high variance and substantial overlap between the distributions. In datasets where CD4 cell count data are available from the sample collection date, recency estimates may be further improved by incorporating this variable. In the present dataset, gains were marginal (< 1 percentage point in r^2^). Age was informative in the two population cohorts (RAK and MRC), but not in the two cohorts enriched for seroconverters (UWP and BEE) (Fig. 3C). We opted against including age as a variable in the present study so as to keep the model as generally applicable as possible.Fig. 3CD4 cell count, viral load and age can be informative. A Most recent CD4 cell count obtained closest to sequencing sampling date. B Viral load (log_10_ copies per mL) obtained closest to sequencing sampling date. C Distribution of age at sampling in different cohorts in the training data set

The estimates were not influenced by the presence of drug resistance mutations: in a conservative test, removal of all genomic windows containing positions of known drug resistance mutations slightly decreased model r^2^ (< 2 percentage points), consistent with a limited contribution of these windows to the overall model. We also tested the effect of taking only the last sample for individuals with > 1 sample, and found no difference to the model r^2^. Given the dates of sampling, none of the participants was infected while using PreP.

Models based on diversity and divergence yield good accuracy and false recency rates

The chosen predictor variables—MAF12c, MAF3c and LRTT—were calculated for each window of 250 base pairs and then averaged separately across gag, pol, env gp120 and env gp41, as well as across the whole genome. In five samples, missing data meant that gene-averaged values could not be derived for one or more genes; these were filled by imputation. The tested feature combinations are included as a supplementary file.

To select a regression method, we tested performance of three commonly used methods (ordinary least squares (OLS), gradient boosted regression, and random forest regression), using a subset of feature combinations to estimate the square-root transformed TSI. The square-root transformation helped the model accommodate greater uncertainty around the actual TSI values of non-recent infections, for which the ground truth was generally less precisely known in our training data (Fig. 1). Cross-validated model scores (mean r^2^ values) were computed using tenfold cross-validation, in each iteration splitting the data randomly into 75% training and 25% test datasets and scoring model performance on the test data (Table S1).

OLS performed adequately for the simplest models, where a single feature was used in addition to the indicator variable for sequencing method, but was outperformed by gradient boosted and random forest regression for larger feature sets. The cross-validated r^2^ scores were similar for gradient boosted and random forest regression, with random forest slightly outperforming gradient boosted regression on all feature combinations assessed (Table S1). Scatterplots of known versus estimated square root-transformed TSI for each of the feature combinations are shown in Figure S5. Random forest regression was selected for subsequent analyses.

To assess model performance with different combinations of features, we compared several metrics using 20-fold cross validation: mean r^2^ on test data (strength of correlation with known TSI) (Fig. 4A), accuracy of predicting infections as having TSI below either 12 or 18 months (Fig. 4B), mean absolute error (MAE) (Figure S6A) and the false recency rate (FRR) (Figure S6B). Since our model predicted continuous TSI, we were able to obtain binary estimates (recent/non-recent) at any arbitrary recency cut-off after running the model. We compared this continuous approach to a ‘classification’ approach that uses a binary outcome throughout, based on a pre-decided threshold of recent versus non-recent infections. Specifically, we used the same feature sets to fit random forest classifier models, with recency defined as true TSI below either 12 or 18 months. For a TSI threshold of 12 months, regression with discretisation at 12 months outperformed classification for all feature sets (Figure S7). For a TSI threshold of 18 months, classification showed an improvement for the most complex feature sets that included individual genomic windows (Figure S7B), but without a corresponding improvement in overall accuracy (Figure S7A). We used the regression approach in subsequent analyses.Fig. 4. Model r^2^ and accuracy of identifying recent infections. Markers show mean and lines show 95% bootstrap CI over 20-fold cross-validation. A Model r^2^ score on test data. B Accuracy, defined as the proportion of samples having TSI correctly estimated as being below or above a cut-off of either 12 months (circles, blue) or 18 months (triangles, orange)

Feature sets either used information from individual genomic windows (‘FULL’ in feature set name), or aggregated information from all windows covering a given gene or the whole genome (‘MEAN’ in feature set name). Feature aggregation resulted in a gain of power: models built with aggregated feature sets had higher r^2^ scores and accuracy than those built with individual genomic windows (Fig. 4). Accuracy varied depending on how genomic features were aggregated (Fig. 4). This was mostly due to missing data. For the same feature combinations, the best-performing aggregate-feature model MEANS_feats_LRTT, which comprised ten aggregated features across gag, pol, gp120 and gp41 as listed in Table S2, had a cross-validated r^2^ score of 0.68 and accuracy of 0.89, while the best-performing individual-windows model FULL had an r^2^ score of 0.61 and accuracy of 0.84. The overall best performing models were generated by the feature set “MEANS_feats_LRTT”, with the greatest proportion of variance explained by MAF and LRTT in the gag region, followed by LRTT in gp120 and pol (Figure S8). The indicator variable for the type of sequencing, is_mrc, carried relatively little importance, suggesting that the results remain robust to the type of sequence data (amplicon-based or capture-based).

In conclusion, random forest regression was chosen over OLS and gradient boosted regression, and a continuous regression over a binary classification. Models with aggregated feature sets performed better than those using individual genomic windows. The best performing aggregated feature set, MEANS_feats_LRTT, was used to fit the final predictive model.

HIV-phyloTSI is suitable for population-level predictions

We evaluated the final model on the original dataset using a leave-one-out strategy on the original dataset (Fig. 5), on a simulated population (Fig. 6) and on an independent dataset (Fig. 7).Fig. 5HIV-phyloTSI is more accurate on population level. A TSI estimates compared with known midpoint TSIs, with point estimates indicated as crosses and seroconversion intervals as lines. Regression line shown in black with confidence interval as shaded grey area. The green line (overlapping with the black line) indicates equalityFig. 6HIV-phyloTSI can predict recency at different cut-offs in a simulated population. Number of recent infections in a simulated population of 1000 individuals where average time to treatment is 3 years, with the recency cut-off varied between 3 and 26 months. The blue line shows the true number; the orange line shows the number estimated by HIV-phyloTSI; the green line shows the subset of estimated recent infections that were falsely estimated to be recent (i.e., were in truth non-recent)Fig. 7HIV-phyloTSI is suitable for population-level predictions. Distribution of TSI estimates for members of the PopART Population Cohort (PC), by HIV status. ‘Baseline positive’ indicates individuals positive for HIV at enrolment (1041 samples); ‘seroconverter’ indicates individuals who acquired HIV during the study period (204 samples). Boxes show extent of first quartile from the median (line); whiskers extend to 1.5*IQR

Estimates of TSI were made for all 527 samples in a leave-one-out strategy, iterating over the full sample set, building a random forest model on the other 526 samples with the previously selected parameters, and obtaining a point estimate together with the full conditional distribution of estimates from the 1000 decision trees in the forest. The standard deviation of the conditional distribution was used to estimate 95% prediction intervals around the point estimates (Fig. 5). A separate random forest regressor was trained on the standard deviation of predictions to estimate prediction intervals for the final model.

Figure 5 shows the overall fit for the best performing model with feature set MEANS_feats_LRTT, comparing known seroconversion intervals against TSI estimates and prediction intervals from the model. The overall regression line is close to the line of equality, indicating that over the majority of TSIs, recency and non-recency are misclassified to the same extent, despite a tendency to slightly overestimate TSIs below 1 year, and underestimate the TSIs greater than 9 years (Figure S9). This implies that although precision is insufficient for an individualised clinical assay, population-level incidence estimates are likely to be accurate, at least in datasets with similar distribution of recency.

The suitability of HIV-phyloTSI for population analysis was tested by applying the model to a simulated population (Fig. 6, Table S2). One thousand people were drawn from a population with an average interval of three years from infection to treatment. The duration of infection of individuals in this population with TSI w was modelled as 1−e^−0.3w^. The predicted fraction of infections with estimated TSI up to w in this population, and the proportion of these that were incorrectly classified, were calculated using the false and true recency rates for the model at values of w between 3 and 36 months. Plotted is the number of recent infections out of the 1000 people depending on where the cut-off for recency is set. The model predictions (orange line) closely track the numbers expected from the simulated data (blue line).

Next, we applied the model to an independent dataset, namely the HPTN 071–02 (PopART) Phylogenetics ancillary study [20] (Fig. 7). The study generated HIV1 sequences from HPTN 071 trial participants in Zambia [33]. We sequenced samples from 204 participants who were HIV negative at enrolment and seroconverted both during the trial and less than one year after their last negative test, shown in orange, and 1041 participants who were positive at enrolment, shown in blue. Most, but not all of these participants were likely infected for more than a year before enrolment. HIV-PhyloTSI results reveal a highly significant difference between the distribution of TSI for samples in the two groups (p < < 0.001, Welch’s t-test). The median estimated TSI for baseline positives was 2.28 years (interquartile range 1.01–4.11) and for seroconverters 0.74 years (IQR 0.47–0.94). Of the 204 seroconverters with known seroconversion intervals, 7 (3.4%) were incorrectly predicted based on a lower TSI limit of over 12 months.

In conclusion, HIV-phyloTSI is able to predict recency with sufficient accuracy in population-level analyses.

HIV-phyloTSI is able to predict recency of infection for all subtypes

Recency assays based on MAAs have shown variable performance for different HIV-1 subtypes [34]. We therefore tested whether the model performance showed bias in predictions for any subtype. Our dataset included at least 100 samples for each of subtypes A-D (A1, 138; B, 101; C, 232; D, 147) as well as small numbers of other subtypes and circulating recombinant forms, which we grouped for this analysis (“Other”, 113). Adjustment for TSI was required since model error increases with TSI regardless of subtype (Fig. 8A and S9), and the over-representation of subtypes A1 and D among non-recent samples in our dataset would otherwise inflate the error range for these subtypes. After adjusting for TSI, there was no difference in bias for any of the subtypes (Fig. 8B and Table S3, p > 0.05 for every pairwise comparisons, Tukey’s range test), indicating that model performance was independent of subtype.Fig. 8HIV-phyloTSI is unbiased with respect to subtype. A Model bias (difference between real and estimated TSI value) by square-root transformed time since infection, coloured by subtype. Circles indicate mean for groups of 0–1, 2–4, 5–9 and 10–16 years since infection, and vertical bars indicate the 95% bootstrap confidence interval in each group, for all subtypes. Lines connect group means to aid visualisation. B Residuals from the linear model used to adjust for TSI of samples (Bias ~ TSI), shown by subtype, for subtypes with at least 5 samples in the dataset. Boxes extend to first quartile from the median (line) and whiskers to 1.5*IQR. Model bias increases with TSI for all subtypes, but is not significantly different between subtypes (p > 0.05 for pairwise comparisons, Tukey range test)

Discussion

In this study, we present a method that uses viral genetic diversity (MAF) and divergence (LRTT) to estimate a continuous measure of time since infection (TSI) from a single HIV-1 NGS sequencing sample per patient. Any value of TSI can be chosen as a threshold to give a binary recency assay, with an accuracy of up to 89%. Additional metadata can be used if available for a specific setting, but is not required to make predictions. The method works equally well on HIV-1 subtypes A-D and is accurate enough for population-level estimates of incidence.

HIV-phyloTSI is a generic method derived using samples from a range of populations in Western Europe and Eastern and Southern Africa. It was developed for population-level analyses performed in the PANGEA-HIV consortium [21]. When used as a standalone tool, it enables the inference of epidemiological information without requiring extra participant data. The model can be further extended to incorporate moderately informative variables such as CD4 cell count and viral load. For sequences obtained with veSeq-HIV, viral load estimates can be obtained directly from counts of uniquely mapped sequencing reads, offering an additional sequence-derived parameter without the need for additional testing [35, 36]. Although CD4 cell count and viral load provided no substantial gain in accuracy with our training dataset, these could offer higher gains for studies where these data are available from the same date as the sequencing samples. Lundgren et al. recently explored the use of biomarkers to augment phylogenetic TSI estimates [37].

There was a slight (< 1 percentage point in r^2^) improvement when the model was extended with the addition of age at sampling, but there are good reasons to avoid including demographic markers in a generic model, as this may bias the model towards the type of population used for training. In our case, the training data contained two large seroconverter cohorts (BEE, UWP) and the association with age was only evident in the population cohorts (RAK, MRC). Likewise, we did not include subtype in the model, to avoid spurious association with subtypes preferentially represented in the seroconverter cohorts and to maintain the generalisability of the model.

HIV-phyloTSI performs best when used for population-level predictions, where its accuracy is sufficiently high that the total number of infections identified with any given TSI corresponds closely with the expected number of infections at that TSI. The close correspondence between expected and observed population-wide estimates of recency is particularly useful, as it allows the identification of recent infections not only in growing epidemics in which recent infections are common, but also in declining epidemics where recent infections are substantially outnumbered by non-recent infections. Although HIV-phyloTSI can be used to give an estimate of the duration of infection on an individual level, the uncertainty on individual level is likely to be too high for clinical applications. The level of intrahost diversity as a proxy for duration of infection however is informative on an individual level and might become relevant for clinical management and potential HIV cure once treatment goes beyond classical ART. The size of the viral reservoir is lower in recently infected individuals compared to those in the chronic stage of infection, and thus easier to eradicate the virus [38]. The utility of knowing the duration of infection at an individual level is useful for HIV cure protocols [39], and staging of individuals with recent infection is currently being used in clinical HIV cure trials [40].

The method has limitations. Any machine-learning model depends on how well classified the training data set is, and the large uncertainties in the seroconversion dates of non-recent samples in our training datasets necessarily limit the accuracy of the model. The training data was selected from different cohorts in sub-Saharan Africa, but validation is still needed to establish that the results are further generalisable. The method also requires sufficient virus from an infection to generate not only a consensus genome but also accurately represent intra-host diversity. Viral loads in this study (2–7 log_10_ copies per mL) are representative of the majority of viraemic individuals. However, in a real population in the era of universal testing and treatment, a substantial number of individuals will be virally suppressed, and a substantial proportion of these may be incident infections. At present, incidence surveys such as PHIA [41] assume that all virally suppressed infections are non-recent, which tends to underestimate incidence in settings where ART may be initiated early in infection. Ideally, recency estimates from viraemic individuals would be combined with multi-assay algorithms (MAA) that generate recency estimates of individuals with viral suppression, to yield more robust population estimates of incidence, as pioneered by Ragonnet-Cronin et al. [31]. We and others are currently working on methods to obtain full-length or near-full-length HIV-1 genomes from samples with a low viral load. Finally, the requirement for deep-sequence data carries a substantial laboratory and computational cost. However, sequencing is rapidly becoming more available and affordable in low-income countries, and the cost of computational power is declining.

We did not test HIV-phyloTSI on sequences obtained by proviral sequencing from individuals with natural or ART-induced viral suppression. The effects of ART on within-host HIV-1 diversification and divergence are not well studied, but it is to be expected that both will be profoundly reduced. A similar effect has been described for PreP in two small-scale studies, one in macaques and one in humans [42, 43]. Both studies find reduced diversity in the HIV viral population four and ten months after infection while taking PreP, respectively. More research is needed to adjust sequence-based methods for determining TSI in virally suppressed persons and for estimating HIV incidence in populations with high levels of viral suppression.

It will be important to test the performance of HIV-phyloTSI in different settings with different infection dynamics. If performance remains comparable to results obtained for the PopART Phylogenetics samples, the method could be used to measure HIV-1 incidence in a new population using a relatively small-scale phylogenetic cross-sectional survey. By sampling at least 1000 participants in an area with medium-level prevalence and incidence, it should soon be possible to estimate incidence using a mathematical model taking into account the fraction of HIV-positive participants, the fraction of viraemic participants and the TSI for viraemic participants. This approach for measuring incidence would require minimal data collection from participants and would be much less costly than measuring HIV incidence through repeated follow-up data collection in longitudinal cohorts.

In summary, HIV-phyloTSI is a powerful tool for estimating TSI, obtaining population-level recency estimates and estimating population-level incidence. Further work is required, but if performance is maintained in different settings, an improved HIV-phyloTSI tool has the potential to become a stepping stone in transforming HIV epidemiology in areas with generalised epidemics. This work would not have been possible without data contribution from multiple cohorts and consortia and highlights the importance of collaboration and data sharing in the area of HIV research.

Methods

Sample collection and sequencing

Samples were collected from venous blood of HIV-1 viraemic individuals, and 0.5 ml of plasma was used for sequencing. All samples were sequenced using veSEQ-HIV [35] except samples from MRC Uganda/UVRI which were generated by pooling four overlapping PCR amplicons as described in Gall 2012 [45].

Bioinformatic processing

Sequence reads were filtered to remove human and bacterial sequences using Kraken [46], and assembled into contigs using SPAdes v3.10.1 (setting–meta) [47]. The resulting contigs were aligned to a curated alignment of 165 representative HIV genomes from the LANL HIV database [48] to identify HIV contigs and generate a consensus sequence, followed by mapping the HIV reads onto the consensus. HIV genomes were assembled with shiver v1.3 [49], which generated a custom consensus sequence for each HIV sample. Shiver was run with Picard deduplication enabled [50], to eliminate duplicate reads that arise during the sequencing process. The final BAM files and base frequencies were coordinate-translated by shiver to bring them into alignment with the standard HXB2 HIV reference genome, RefSeq accession NC_001802. The reference alignment and primer sequences are available from the shiver GitHub repository [51].

Estimation of within-host diversity (MAF)

Within-host diversity was estimated as the cumulative minor allele frequency (MAF) at each genomic position:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{MAF }} = \, \left( {{\text{depth }}{-}{\text{ number of reads supporting the majority base}}} \right) \, /{\text{ depth}} $$\end{document}

at each position, where depth was the number of unique (deduplicated) reads observed at that position. MAF values at first and second codon position were termed ‘MAF12c’, and MAF at third codon position was termed ‘MAF3c’.

Estimation of within-host divergence (LRTT)

To estimate the extent of within-host divergence, each individual’s HIV sub-population was examined in a series of overlapping windows positioned every 10 bp along the entire length of the HIV genome, excluding the terminal repeat regions. For each 250 bp window, a maximum likelihood phylogeny was estimated using IQ-TREE [52] with the GTR + F + R6 model (generalised time reversible model with FreeRate with six categories of rate variation). Trees were processed in phyloscanner [32] with the settings -sks -ow -rda -swt 0.5 -amt -sat 0.33 -rcm -blr -pbk 15 -rtt 0.005 -rwt 3 -m 1E-5, and all statistics relating to the depth of the tree were extracted from the output of that package. Statistics related to tree depth and branch lengths were assessed for strength of correlation with the known time since infection (TSI) of the training data. These included LRTT, reported by phyloscanner as the field ‘normalised.largest.rtt’ within its patStats.csv output file; overall root-to-tip distance (‘normalised.overall.rtt’) and pairwise distance (‘subgraph.mean.patristic.distance’). The LRTT variable corresponds to the maximum evolutionary distance from the most recent common ancestor of the virus within the host, as observable in that window, normalised by genetic divergence at this genomic locus in the global HIV phylogeny. LRTT values at each window centre were collected for each sample. In addition, two other statistics that relate to the phylogenetic estimates were collected: ‘solo.dual.count’, representing the probability that the sample has come from a dual infection, and ‘tips’, representing the number of tips in the given window.

Generation of aggregate statistics (feature engineering)

MAF3c, MAF12c and LRTT were calculated for the centre of each 250 bp genomic window, excluding windows with known drug resistance mutations, and averaged for each major gene (gag, pol, gp120 and gp41) and across the whole genome. The mean number of tips was likewise calculated per gene and for the whole genome. A single genome-wide measure of number of windows supporting dual infection was generated by taking the mean of phyloscanner variable ‘solo.dual.count’ over all genomic windows, for each sample.

Imputation of missing data

Where data were missing, for example because there were insufficient reads to determine the tree for that window, we tested two imputation strategies: zero-filling (on the assumption that lack of variation resulted in absence of an LRTT value), versus a more complex strategy of using the K-nearest neighbours as implemented in TensorFlow, with the python package fancyimpute [53]. In order to enable robust calculation of Euclidean distances by fancyimpute, the features were first standardised to a mean of zero and standard deviation of 1. To guard against over-imputing, which can cause model over-fitting, we excluded windows where over 40% of samples had missing data, and excluded samples over 40% of windows had missing data. In addition, windows that included the position of the amplicon HIV-1 primers used for sequencing in the MRC cohort were excluded, as variation at these positions would be expected to be uninformative in the amplicon data. This resulted in a dataset containing 527 samples, with data at 820 genomic windows. We found that both imputation strategies had performed similarly on these samples, and chose KNN as the more robust method for the final model.

Regression method selection

To select a regression method for predicting the midpoint time since infection (TSI), three different strategies were tested: ordinary least squares regression, gradient boosted regression, and random forest regression, as implemented in the scikit-learn package [54]. The midpoint is commonly used for estimating incidence, an alternative measure would have been a random-point estimate [55]. Models were fitted using combinations of the MAF3c, MAF12c and LRTT aggregated predictors (Table S1). For each combination of predictors, the dataset was split into 75% training: 25% test data, and performance of each regression method was assessed as the r^2^ on the same fold of test data, for each combination of features. This procedure was repeated 10 times for cross-validation. Maximum depth of decision trees was constrained to 7 to prevent overfitting, and the random forest contained 1000 decision trees.

Identification of most informative windows

The LRTT, MAF3c and MAF12c values within all genomic windows were individually assessed in univariate OLS regression models for prediction of TSI. The LRTT and MAF3c values at the most informative windows, defined as r^2^ of 0.3 or above, were used as one of the feature combinations for model selection (‘LRTT_MAF3c_topwin’).

Feature selection

A table of possible feature combinations was generated for testing the predictive power of genome-wide values of MAF12c, MAF3c and LRTT, with or without the additional phylogenetic statistics of dual count and tip count. Feature combinations that used all or a subset of individual genomic windows were labelled as “FULL”, while aggregated features (averaged across each major HIV gene一gag, pol, gp120 and gp41一and for the whole genome) were labelled “MEANS”. A binary indicator variable for the sequencing method (amplicons or veSeq) was always included. For each feature combination, predictive power was assessed with k-fold cross-validation. Random forest regression models, set up as described in ‘Regression method selection’, were trained on 75% of the data, leaving 25% as the test dataset; this procedure was repeated 20 times for cross-validation, each time assessing performance on the test dataset using the model r^2^ on test data, precision (width of 95% CI of the distribution of estimates returned from the 1000 decision trees within the forest), and mean absolute error (absolute difference between the expected and estimated TSI). Additionally, the expected accuracy and false recency rates were calculated, respectively, as the proportion of decision trees in the model that correctly returned estimates that fell below/above 1 year, and estimates below 1 year when the known midpoint TSI was above 1 year. Feature sets were preferred if they had a higher r^2^, accuracy and precision; where values were similar, the feature set with the lower expected false recency rate was chosen. The final feature set (designated ‘MEANS_feats_LRTT’) comprised: mean LRTT in each of gag, pol and gp120, MAF3c in gag and gp41, MAF12c in gp41, and the mean number of tips in each of gag, gp41, and gp120.

TSI estimates for all samples

A leave-one-out strategy was used to obtain estimates of TSI for all samples, iterating over the full sample set, each time dropping the sample of interest, building a new random forest regression model on the remaining samples using the MEANS_feats_LRTT feature set, and obtaining a point estimate together with the full distribution of estimates from each of the 1000 decision trees in the forest. A separate random forest regression model was trained on the mean absolute errors from the base model, to generate 95% prediction intervals around the point estimates.

Performance assessment by subtype

To check for evidence of subtype-dependent bias in model performance, we compared the mean absolute error (MAE) for all samples aggregated by subtype, after adjusting for TSI to account for the increase in MAE with increasing TSI. We fitted a linear model of MAE by TSI (MAE ~ TSI) and compared the residuals by subtype, using Tukey’s range test for multiple comparison of means as implemented in the Python statsmodels library (statsmodels.stats.multicomp.pairwise_tukeyhsd), with alpha set to 0.05.

Model performance in a simulated population

We simulated a population sample of 1000 HIV-infected individuals under the assumptions that the mean time to viral suppression in this population was 3 years. The duration of infection of individuals in this population with TSI w was modelled as 1−e^−0.3w^. The predicted fraction of infections with estimated TSI up to w in this population, and the proportion of these that were incorrectly classified, were calculated using the false and true recency rates for the model at values of w between 3 and 36 months.

Application to an independent dataset: PopART population cohort

The HPTN 071–02 Phylogenetics ancillary study to the HPTN 071 (PopART) trial collected samples from HIV-positive study participants in nine communities in Zambia between 2014 and 2019 [20]. Unused samples from vials collected to assess the main trial outcome were sequenced using veSeq-HIV. Sequences were assembled using shiver, and MAF values generated in the same way as for the training data. Sequences were batched in randomly allocated groups of 100 to build trees and obtain LRTT values, which were then used as inputs for the HIV-phyloTSI predictive model. Model outputs were TSI and estimated prediction interval. For the subset of 204 samples for which a last negative test date was available, midpoint predictions from HIV-phyloTSI were compared with midpoints of the known seroconversion interval, as was done for the training data.

Supplementary Information

Supplementary Material 1.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1HPTN 071-02 Study Protocol (2017), (available at https://www.hptn.org/sites/default/files/inline-files/HPTN%20071-2%2C%20Version%202.0%20%2807-14-2017%29.pdf).
2E. Lundgren, E. Romero-Severson, J. Albert, T. Leitner, Combining biomarker and virus phylogenetic models improves epidemiological source identification. bio Rxiv (2021), p. 2021.12.13.472340.10.1371/journal.pcbi.1009741 PMC 945587936026480 · doi ↗ · pubmed ↗
3HIV-phylo TSI: Estimate time since infection (TSI) from HIV deep-sequencing data (Github; https://github.com/BDI-pathogens/HIV-phylo TSI).
4Picard, (available at https://broadinstitute.github.io/picard/).
5Shiver: Sequences from HIV Easily Reconstructed (Github; https://github.com/Chris HIV/shiver).
6A. Rubinsteyn, fancyimpute: Multivariate imputation and matrix completion algorithms implemented in Python (Github; https://github.com/iskandr/fancyimpute).
7Scikit-learn: machine learning in Python—scikit-learn 0.16.1 documentation, (available at http://scikit-learn.org/).