Newly designed amplicons-based method for near-full-length genome (NFLG) sequencing of HIV-1 group M recombinant forms
Roy Moscona, Tali Wagner, Miranda Geva, Efrat Bucris, Oran Erster, Neta S. Zuckerman, Orna Mor

TL;DR
This paper introduces a new method for sequencing nearly the full HIV-1 genome from blood samples, enabling detection of recombination events across the virus.
Contribution
A novel amplicon-based method for near-full-length HIV-1 genome sequencing is introduced, optimized for clinical samples and recombination analysis.
Findings
The method uses 32 primer pairs in two pools to generate overlapping fragments covering most of the HIV-1 genome.
The method achieved better genome coverage compared to PCR-free metagenomic sequencing.
The approach successfully identified recombination events in clinical samples from Israel.
Abstract
Over the years the spread of HIV-1 across the globe resulted in the creation of multiple subtypes and new recombinant forms (RFs). While the pol gene region of the HIV-1 genome is used for resistance mutations analysis and initial detection of RFs, whole genome sequencing analysis is required to determine recombination events across the viral genome. Here, we present a newly designed robust near-full length genome (NFLG) sequencing approach for the sequencing of HIV-1 genomes, out of clinical whole blood samples. This method has been successfully tested for various HIV-1 subtypes and RFs. The method is based on an in-house developed set of 32 pan-genotypic primer pairs, divided into two pools, each containing 16 primer pairs covering the entire HIV-1 genome. Two parallel multiplex PCR reactions were used to generate 32 overlapping DNA fragments spanning the HIV-1 genome. Nextera XT…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —Tel Aviv University
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHIV Research and Treatment · HIV/AIDS drug development and treatment · Cytomegalovirus and herpesvirus research
Introduction
The global HIV-1 pandemic can be attributed to HIV-1 group M (major). Over the years, as the HIV pandemic spread globally within the human population, an increase in viral genetic diversity has been generated, creating a split of group M into several different viral subtypes (A–D, F–H, J, K, L). The leading factors for viral diversity are the high mutation rate of the virus and recombination events, including intra and inter subtypes, which lead to the creation of new recombinant forms (RFs) [1, 2]. Circulating recombinant forms (CRFs), are RFs identified in at least three epidemiologically nonrelated individuals, while unique recombinant forms (URFs), are patient specific.
Since sequencing of the pol gene of the HIV-1 genome (mainly the protease {PR}, reverse transcriptase {RT}, and integrase {INT} genes) is still highly recommended before starting anti-retroviral treatment (ART) treatment, many studies utilize these sequences to define RFs [3, 4]. However, regions other than the pol gene are not routinely included when analyzing viral sequences, obtained from whole blood samples. For the characterization of recombination events across the viral genome, full or near full-length genome (NFLG) sequencing methodology are utilized. Commonly, PCR amplified DNA fragments, varying in size and length are produced to cover the entire viral genome. These amplicons serve as input for sequencing, either Sanger or Next-Generation Sequencing (NGS). As we have shown before PCR products of HIV-1 viral genes can be sequenced in either platform [5].Alternatively, metagenomics, which do not require amplification of the viral genome, could be used. However, the latter demands high resources per sample and is not feasible for large scale clinical settings. Previously, several different methods for HIV-1 NFLG sequencing were developed, some also capture RFs [6]–[9].
Here, we present a newly designed robust, relatively simple approach, for a pan HIV-1 NFLG sequencing protocol based on multiple short PCR products. This method was developed to capture the clinically relevant HIV-1 group M pure subtypes and CRFs circulating in Israel and in other developed countries. To validate our newly designed method, we compared our NFLG results to the sequencing results of a PCR-free metagenomic method.
Materials and methods
Sample selection
In a previous study by the research team of the National HIV Reference Laboratory (NHRL), HIV-1 CRFs and URFs in Israel were defined, based on HIV-1 PR and RT partial sequences for people living with HIV-1 (PLWH) diagnosed between 2010 and 2018 [3]. Here, plasma samples extracted from whole-blood, for33 patients suspected of representing different CRFs and URFs (and another 2 samples of known HIV-1 pure subtypes served as control, totaling in 35 samples), were used for the assessment of a new NFLG sequencing method.
Selection of primers for multiplex PCR
The LANL database (https://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp) was searched for full HIV-1 sequences (sequence length > = 9000 bp) that were sampled between 2000–2019 aiming to generate a diverse input dataset for primer selection. To narrow down the search results, only HIV-1 subtypes and RFs known to be circulating in Israel were selected. These included different HIV-1 group M pure subtypes (B, C, A1, D, G, A6, F1, H, U) and RFs (AE, AG, CRF30_0206, BC, A6B, A5U) and complex forms (cpx) (06, 45, 27) (Supp. table 2). The query resulted in 291 HIV-1 genome sequences, of which 118 representative sequences for each HIV-1 pure subtype, RF, cpx (above-mentioned) were selected. To facilitate the task of designing new primers, 291 sequences were further reduced to 118 sequences. The 118 sequences were randomly selected out of the 291 HIV-1 whole genome sequences, to represent the genetic diversity of HIV-1, yet the number of sequences from each viral clade was proportionally selected to match its population prevalence among PLWH in Israel. This dataset was further divided into two sets of sequences: a training set and a testing set. The training set contained newer isolates, collected between the years 2003–2019 (n = 76, Supp. table 2.1), while the testing data set contained older sequences isolated between the years 2000–2002 (n = 42, Supp. table 2.2). The training dataset served as a template for designing new primers, while the testing dataset was utilized virtually, for in silico PCR analysis, to ensure the designed primers were suitable, before synthesizing the primers for in vitro multiplex-PCR. See supp Table 2 for the accession numbers of FASTA sequences used. The generated dataset of HIV-1 sequences was aligned to the HXB2 reference genome by Geneious Prime (Dotmatics, Boston, MA, version 2022.2.2, Build 2022–08-18) default mapper using default settings. The primal-scheme [10] tool (https://primalscheme.com/) was used to identify conserved regions across the HIV-1 genome in a sliding windows of ~ 400 bp (Fig. 1). Primers were constructed to enable amplification of the whole viral genome in two separate PCR reactions divided into two pools of overlapping fragments, varying in length with an average fragment size of 368 bp. The primal-scheme primers design was further edited using Geneious Prime, to incorporate ambiguous nucleotides and adjust the genomic location of each primer, whilst keeping optimal biochemical features and maintaining an appropriate distance between PCR product fragments yet keeping some overlap between them. The two pools of primers and their locations in HXB2 are presented in Supp. Table 1.Fig. 1. Schematic mapping of the primers covering HIV-1 HXB2 genome. The general outline of the primers across the HIV-1 reference HXB2 genome designed using primal-scheme [9] is shown. The primer positions are colored in bright red; the amplicons are shown in darker shaded red. The pink line under the genomic location represents the GC content of the corresponding genomic region
HIV-1 amplification and DNA library preparation for NGS
For the newly designed amplicons-based HIV-1 NFLG, first HIV-1 RNA was extracted from plasma of whole blood samples using Nuclisense protocol (BioMerieux), complementary DNA (cDNA) was synthesized using SuperScript III first strand kit (Thermo Fisher Scientific, Waltham, MA). Two multiplex-PCR reactions were performed in parallel using Q5® High-Fidelity DNA Polymerase (New England Biolabs, MA), in a ready-to-use master mix (only the primers, designed in this study {Supp. table 1}, were added) in two test tubes, each containing a different primer pool (pool 1 or pool 2(.The PCR reaction total volume was 22.5µL with the following reagents: (1) Q5 Hot Start High-Fidelity 2X Master Mix {12.5µL}, (2) Primers Pool in a concentration of 10 µM (Reaction #1 with pool 1, Reaction #2 with pool 2) {4 µL}, (3) cDNA {5µL}, (4) Nuclease-free water {1 µL}. The reaction thermal profile was as follows: (1) 98 °C (30 s) {1-cycle}, (2) 98 °C (30 s), 65 °C (5 min) {30 cycles}, (3) 4 °C {hold}. These unique Multiplex PCR conditions have been used before for the full length genome sequencing of the Zika virus [11] and more recently for SARS-Cov-2 (COVIDSeq) [12], here we have found these condition to be optimum for HIV-1 NFLG sequencing, considering different primers melting temperatures (Tm), DNA polymerase functionality and amplicons length. Next, the two pools were combined in equal molarity and diluted. This pool of the two multiplex PCR reactions was used as input for the NGS DNA library preparation protocol the Nextera XT protocol (Illumina, San Diego, CA).
For the metagenomics (virome sequencing) protocol four samples were selected and HIV-1 RNA was extracted as described before. However, in this reaction the SMARTer Stranded RNA-Seq Kit (Takara Bio, San Jose, CA) was used to generate cDNA that served as input for the same NGS DNA library preparation protocol.
For both methods (amplicons and metagenomics), the resultant libraries were purified with AMPureXP beads (Beckman Coulter, Brea, CA). Library concentration was measured by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA). Library validation and mean fragment size (~ 300 bp) was measured by TapeStation4200 (Agilent, Santa Clara, CA) with DNA High-Sensitivity D1000 (Agilent, Santa Clara, CA). Following dilution to 4 nM equal molar quantities of the libraries were pooled, denatured and diluted to final concentration of 15 pM. The final DNA library was subjected to paired-end sequencing (2X100bp) either on NovaSeq (Illumina, San Diego, CA), for metagenomics, or MiSeq (Illumina, San Diego, CA), for the new amplicons-based methodology.
Bioinformatics analysis
Briefly, the processing of HIV-1 sequences as raw FastQ files was performed using an in-house bioinformatics pipeline which utilizes the following steps: FastQ files underwent quality control assessment using FastQC and MultiQC [13] and low-quality sequences were filtered using trimmomatic [14]. Sequences were mapped to the HIV-1 reference genome (HXB2, GenBank: K03455.1) using BWA-mem [15]. BAM files were sorted and indexed and coverage and depth of sequencing was calculated from sorted BAM files using the SAMtools suite [16]. Consensus FastA sequences were assembled for each sample using iVar [17], with consensus set to 60% and positions with < 5 nucleotides determined as not significant and were not covered. Multiple alignment of all the HIV FastA sequences and the reference sequence was performed with MAFFT (V7.526) [18]. Full details of the used pipeline can be found in https://github.com/NetaZuckerman/UPv/blob/main/pipelines/generalPipeline.py.
Analyses of HIV-1 recombination forms with phylogeny, HIV-BLAST, RIP, SimPlot + + and jpHMM
Phylogenetic trees were constructed with the neighbor-joining (NJ) algorithm, while implementing a 1000- bootstrap repetition, used for branch assessments. Using NJ allowed us to keep as much of the NFLG sequence, while avoiding possible loss of recombination signal during the alignment curation step prior to tree plotting, as the algorithm de-gaps the alignment before drawing the tree, yet we cannot rule out a possible reduction of the phylogenetic signal in some poorly aligned genomic regions. Four different phylogenetic analyses were performed. First, a phylogenetic tree based on NFLG sequences was built. This phylogeny approach enabled the characterization of the dominant viral clade of recombination (major parent). Next, an additional three independent phylogenetic trees were constructed, one for each viral gene: gag, pol, env. This approach was chosen to identify specific recombination events in the main three viral genes, aiming to identify minor parents. All phylogenetic trees were constructed with known pure subtypes and RFs as reference sequences. This enabled the classification of each sample to an appropriate HIV-1 viral clade. HIV-BLAST was used to assess the similarity to known public repository sequences. Furthermore, Simplot + + was used to identify genome wide recombination events [19]. Simplot + + is utilizing similarity plots to determine the probability of the recombination event and its breaking points across the HIV-1 genome. Simplot + + was used in BootScan mode with a 100-bootstrap repetition, using a 400 bp window with 20 bp step size with the TN93 model for substitutions. Additionally, jumping profile hidden Markov model (jpHMM) [20] and recombinant identification program (RIP available at www.hiv.lanl.gov/content/sequence/RIP/RIP.html) methods were used with default parameters. Final characterization of each CRF or URF was based on a combinatory analysis of the results obtained by all analytic tools used here, while considering each tool’s advantages and drawbacks in resolving recombination events across the viral genome.
Results
Comparison between metagenomics and the newly designed amplicons method for HIV-1 NFLG sequencing
To compare the sequencing results obtained by the newly developed amplicons method to un-amplified metagenomics (virome sequencing), four samples were selected. While the average total number of single-end reads per sample obtained by the metagenomics approach was 66,178,474 (range: 57,842,929–71,825,644) reads, only an average of 59,560 reads (0.09% of total reads) were mapped to the HIV-1 HXB2 reference genome. Yet, the relatively low number of reads mapped to the HIV-1 genome may be because this approach did not target the HIV-1 genomes specifically, but rather all RNA viruses present in a sample. One may expect this metagenomic approach to have low coverage of the HIV-1 genome, due to distribution of reads across all present viral genomes.On the other hand, for the newly designed amplicons NFLG method, only 2,588,962 (range: 1,254,560–3,099,939) total reads were obtained per sample, however on average 2,191,779 reads (84.6%, of total reads) were mapped to the HIV-1 HXB2 reference genome. Comparison of the coverage (> 5 reads per position in both methods) across HIV-1 HXB2 reference genome between the metagenomics and the NFLG amplicons methodology, revealed that the amplicons displayed a higher total coverage when directly compared. The two sequencing methods produced identical results, while surprisingly the amplicons managed to reach greater coverage and depth, sequencing new positions across the HXB2 genome, which were not reported by metagenomics (Fig. 2). Generally, among the different HIV-1 subtypes, metagenomics obtained an average coverage of 85.4% while the newly developed amplicons approach covered 90.6% of the HXB2 reference genome bases. The difference in coverage observed between samples could be attributed to diversity in viral clades. The coverage of the control sample of HIV-1 pure subtype B was 97.27% vs 95.38%, amplicons vs metagenomics, respectively, (sample 99,666, Fig. 2A), while the coverage of the other control sample of HIV-1 pure subtype C (sample 100,545, Fig. 2B) was lower (88.73% vs 81.94%). For the other two non-pure HIV-1 subtype samples, the same range was observed; sample 84,808 (Fig. 2C, 89.3% vs 74.6%) and sample 79,455 (87.65% vs 73.77% Fig. 2D), amplicons vs metagenomics, respectively. When aligning the HIV-1 NFLG consensus sequences generated by the two methods (amplicons and metagenomics), a full match was observed in common positions in all samples.Fig. 2. Comparison of HIV-1 HXB2 genome coverage between the newly designed amplicons method and metagenomics. Values (Y-axis) represent normalized number of reads mapped to the corresponding base position in the HXB2 genome (X-axis). Y-axis directionality ( +) for Amplicons and (-) for Metagenomics has no mathematical meaning and was plotted for comparison of each genomic position between the two methods. In the middle of each plot the HXB2 genome map is depicted with genomic regions/genes as follows: 5’LTR, gag, pol, vif, vpr, vpu, env, nef, 3’ LTR
In silico down sampling scaling to resolve the minimal number of reads required for maximal coverage by the newly designed amplicons method
The coverage of bases of the HIV-1 reference genome HXB2 varied between different HIV-1 subtypes. Thus, to enable a unify criteria for a pan-HIV-1 settings, a calculation was made to determine the optimal number of reads required for maximal coverage of the 9,719 bases of the HXB2 reference genome, while sequencing the maximal number of samples in parallel on a single run of a Miseq V3 kit. For the down-sampling analysis only 10 samples were sequenced for parallel sequencing, and each sample received an even number of reads. Each base-pair position was defined as covered if it had >5 reads allocated to it. Out of the 10 tested samples, HIV-1 pure Subtype B achieved the highest coverage of HXB2 bases, while other viral Subtypes and RFs (AE, AG, C, A) displayed lower full genome coverage (Fig. 3).We concluded that in the tradeoff between the number of reads assigned for each sample and coverage of HXB2 bases, the required number of paired-end reads is between 5 × 10^5^ to 7.5 × 10^5^ reads, as there was little increase in coverage beyond 7.5 x 10^5^ reads (Fig. 3). Beyond 7.5 × 10^5^ reads, additional reads did not increase the coverage significantly. Hence, we concluded that one Miseq V3 kit was sufficient to run 20 samples in parallel, while keeping adequate coverage and maximal usage of samples sequenced per kit. Additionally, all samples sequenced in this study, with the newly designed method for HIV-1 NFLG, were mapped to the reference genome HXB2, for a by base-pair comparison of HXB2 coverage (Supp. 3).Fig. 3. Reads down-sampling analysis for establishing the optimal number of reads needed to allocate for each sample. The blue lines correspond to the Y-axis on the left, representing HXB2 genome precent coverage. The red lines correspond to the Y-axis on the right corresponding to the mean sequence depth (in number of reads). The X-axis are the number of reads (K = thousand, M = million) randomly sampled. For each one of the 10 samples analyzed, a shape was assigned as shown on the grey legend, also in the legend in parenthesis next to sample number is the assigned HIV-1 subtype. The top blue line with diamond shape corresponds to sample 88,192 of HIV-1 subtype B, while bottom blue line with triangle shape corresponds to sample 69,734 of HIV-1 subtype C and A1. In the plot X-axis, the TOTAL average number of mapped paired-end reads was 1,095,890 and varied by samples and subtypes
Identification of recombination events among the NFLG sequences
To resolve the recombination events across HIV-1 NFLG sequences for each sample, several analytic tools were practiced. The tools applied here to identify RFs were phylogenic analyses, HIV-BLAST, SimPlot++ , LANL RIP, jpHMM [3, 9, 19, 20]. The subtyping tools displayed a general strong agreement between themselves, identifying a similar if not identical HIV-1-subtype for every sample. To increase the confidence level that each recombination event reported by each tool is not an artifact, we sought agreement between at least two different tools, to increase the confidence level in the resolved recombination event. SimPlot++ plots for all study samples are available in (Supp. 4).
Resolution of recombination events across HIV-1 NFLG sequences
For each sample a combinatory analysis was used to reach a final conclusion that would determine which of the HIV-1 subtypes underwent recombination (Table 1). Samples 57030 and 64288 were classified in all four tools as RF AE and resolved to be CRF01_AE. Both samples had subtype B classification in only one tool: sample 57,030 in jpHMM while sample 64,288 in the phylogenetic tree, since these observations were not consistent with at least two different tools, we concluded those as artifacts and resolved these samples as RF AE. Sample 79,109 had displayed a minor sequence of RF AE, in addition to subtypes B and F1 recombination. This finding was reported by two different tools (jpHHM and SimPlot++), hence it was concluded as true positive and this sample was assigned a subtype F1 and B and CRF01_AE recombination. Samples 65699 was resolved as subtype B and C as at least two tools reported the presence of these subtypes. Similarly, sample 89,194 had subtypes B and C finding in addition to subtype D reported by three tools including the phylogenetic analyses of the gag and pol genes, thus it was resolved as subtypes B and CD or 41_CD based on NFLG phylogeny.Table 1. Combinatory analyses for the resolution of HIV-1 subtypes recombination across the study samplesSample ID/AnalysisPhylogenetic tree (NJ) (gag,pol,env)RIPjpHMMSimPlot + + HIV BLAST Hit{HXB2bases/similarity}Phylogenetic tree (NJ)Final decision57030AE,AE,AEAEAE, BAE01_AE: AB220946, {5371/5673 (95%)}AECRF01_AE59068AG,AG,AGAGG, A1,AGSubtype 02_AG: AB485634, { 5885/6612 (89%)}AGCRF02_AG64288AE,AE,AEAEAE,AESubtype 01_AE: AB220946, {5681/6298 (90%)}01B/AECRF01_AE65699C,C,CB,CB,CC,BSubtype 07_BC: KF250372, {5641/5777 (98%)}BCBC69734(U,G),C,CC,A1,BC, G, A1C,A1Subtype C: AB485645, { 5509/6331 (87%)}CC, A170264AG,(A7,A4),(A4,A1,A)AG,A1,A3,BA2, A1AG,A3,A4,A7Subtype 02A3: AY521633, {3726/4040 (92%)}36_cpx/22_01A1AG, A70787A6,A6,A6A6,BA1, B, GA6,BSubtype A6: EF589044, { 5763/6383 (90%)}A6A6, B72786U,(U,G),GJ,G,BC, G, A1, J, BAG,J,U,GSubtype 06_cpx: AJ288982, {5279/5676 (93%)}U/32_06A6/06_cpxU, G, J74004AG,(A7,A4),(A4,A1,A)AG,A1A1, CAG,A4,A7Subtype CRF_30_0206 (AG derived) or A1:AJ508595, OP039481, {4973/5505 (90%)}36_cpx/22_01A1AG79109F1,F1,F1F1,BF1, B, AEF1,B,AESubtype B: KT427797, 5298/6078 (87%)B/F1F1, B, CRF01_AE79455AG,AG,AGAGAE, A1, G, BAG,BSubtype 02_AG: AB485635, {6250/7114 (88%)}95_02B OR 19/37/56 cpxAG80405B,B,BBBBSubtype B: AY835781, { 5754/6164 (93%)}BB80743AG,AG,AGAG,BF2AG,B,CSubtype 02B: ON421476, { 4126/4605 (90%)}95_02B OR 19/37/56 cpxAG, B80907AG,(A7,A4),(A4,A1,A)AG,A3,BA1, BAG,A4,A7,A1Subtype 02A or 02A3: KX228819 or AY521633, {2913/3377 (86%), 2318/2523 (92%)}36_cpx/22_01A1AG, A81234AG,(A7,A4),(A4,A1,A)AG,A3A1, GAG,A7,A1Subtype 02A3: AY521633, { 4036/4370 (92%)}36_cpx/22_01A1AG, A81239C,C,CC,A1C, A1C,ASubtype BC or C: HM100716 or DQ396370, {5107/5612 (91%)}CC, A182664G,(U,G),(L,J)G,H,J,A7,A1G, C, A1, JG,H,AE,J,L,USubtype G or 25_cpx: MH705134 or KY392772, {5023/5757 (87%)}27_cpx/13_cpxG, H, J82699F1,F1,F1F1,BF1, BF1,B,AESubtype 12_BF: AF385934 {5847/6255 (93%)}BF1BF183456A7,AG,A7AGA2G,AG,A3,A4,A7Subtype 02_AG: MF109709, {4516/5152 (88%)}09_cpx/A7AG, A83493AG,B,BB,AGB, A1B,AG,GSubtype B:MF109558, { 7459/8541 (87%)}03_A6B/94_cpx/132_cpxAG, B84808C,C,CA1, CC, A1, GC,B,A,AESubtype C: MF109413, {6955/8229 (85%)}CC,A185576AG,(A7,A4),(A4,A1,A)A3,AG,A1,BA1AG,A3,AE,A7Subtype 01_AE or 02A: U51189 or KX228819, {4951/5534 (89%)}36_cpx/22_01A1AG, A85779U,(U,G),GG,K,AGA2AG,J,G,USubtype 06_cpx or A1DK: FJ183725 or X04415, 4640/5309 (87%) or 4597/5299 (87%)A6/06_cpxU, G, AG86845AG,AG,AGAG,C,BA1,C, G, BAG,C,B,AE,A,GSubtype CRF01_AE: ON421476 {5023/5923 (85%}, VI26F2885 {4795/5418 (89%)}95_02B/56 cpxA, B, C, AG88067AG,(A7,A4),(A4,A1,A)AG,A1A1, BA3,AGSubtype A1 or 02_AG: OP039479 or AB286862, 4942/5564 (89%) or 4936/5555 (89%)36_cpx/22_01A1AG, A88192B,B,BBBBSubtype B: AY835754, 8769/9728 (90%)BB88864G,(D,B),GB,GG, A2, B, F1G,B,D,USubtype 20_BG: AY586544,2421/2547 (95%), 2554/2874 (89%)BGB, G89194D,D,CD,CD, C,GC,BSubtype D: AJ320484, 6064/6623 (92%)41_CD/DD, C, B89514AG,(A7,A4),AGAG,A3A1B,AG,A,A7Subtype 02_AG or A1: AB485634 or OP039481, 5005/5538 (90%)36_cpx/22_01A1AG, A95741A6,A6,A6AGA2A6,B,ASubtype A6: EF589044, 4873/5275 (92%) and 1510/1805 (84%)A6A696338B,B,BBB,A2,F1BSubtype B:KJ704791, 2588/2793 (93%) and 2488/2721 (91%)BF1/01BB99478G,(D,B),AGB,GG, A2, BG,AG,BSubtype 20_BG OR 24_BG: AY586545 or AY900574, 2383/2512 (95%) or 2432/2745 (89%)BGG, AG, B99666B,B,BBBBSubtype B JA806687/HH978650 {8697/9746 (89%)}BB100545C,C,CCC, K,BCSubtype C: AB254141, 02ZM108, {5812/6587 (88%)}CC102880(K,F1,F2),B,F1B,F1B, F1, A1F1,B,ASubtype BF1: MN485983, 5562/6067 (92%) and 1383/1737 (80%)DF/F1/BFBF1All analyzed samples in the study HIV-1 subtyping results of all computational tools used for classifying each sample to an HIV-1 subtype. On the second column of the phylogeny analysis, the HIV-1 genome was broken to the three main viral genes: gag, pol, env. Each HIV-1 subtype separated by a comma (,) in each row corresponds to each gene, respectively. For the columns of RIP, jpHMM, SimPlot + + the major parent HIV-1 subtype is written first, followed by minor parent/s. On the final decision column (shaded in red), where there are two different HIV-1 subtype written; the first indicates the major parent, while the second indicates the minor parent of the recombination event
Samples 69734, 81239, 84808 were resolved as recombination of subtype C and A1, as at least two tools displayed that identification, however the NFLG phylogeny and HIV-BLAST favored subtype C, since it was the major parent in those samples, thus we concluded a subtypes C and A1 recombination. Samples 59068, 74004, 79455 were resolved as either RF AG or CRF02_AG, as all methods (except jpHMM) agreed with that classification. Samples 70264, 80907, 81234, 83456, 85576, 88067, 89514 were resolved as RF AG and subtype A, however all (besides one) were suspected to be cpx 36 or 22_01A1, based on NFLG phylogeny. Only sample 83456 showed a different NFLG phylogeny result with resemblance to cpx09 or subtype A7. The HIV-BLAST results for these samples did not cover enough HXB2 bases (most only covered ~ 2000 bp) and the classification to subtype A in all of them seemed to match the minor parent, yet sample 88067 had a ~ 5000 bp matching for an HIV-BLAST search with subtype A1 and RF AG. Since in most tools RF AG was the dominant RF, it was concluded as the major parent, while subtype A was the minor parent.
The samples that displayed additional recombination to RF AG were 85779, 86845 with subtypes U and G and RF AG or subtypes A, B and C and RF AG recombination, whether these forms are 06_cpx or CRF01_AE (respectively), based on HIV-BLAST results or URF warrens further investigation. Sample 96338 displayed a smaller coverage across the HXB2 genome compared to the rest of the samples, thus we could not conclude if this is a true recombinant form of subtypes B and F1, so it was resolved as subtype B. On the other hand, samples 82699 and 102880 were resolved as subtypes B and F1 recombinants. Interestingly, the gag gene of sample 102880 was classified as an out group for a cluster containing subtypes K, F2, F1, displaying the advantages of a gene-by-gene phylogeny analysis and indicating the recombination event happened in the gag gene’s location. Sample 70787 was resolved as subtype B and A6 recombination, even though the phylogeny clustered only with subtype A6, RIP and SimPlot + + indicated a presence of subtype B. Therefore, the major parent was resolved as subtype A6, while the minor parent was subtype B. Sample 95741 was resolved as subtype A6, based on phylogeny and HIV BLAST results as the rest of the tools showed discrepancies and were not conclusive. Samples 80743 and 83493 were resolved as a recombination of subtypes B and RF AG, three tools (phylogeny, RIP, SimPlott + +) identified parts of subtype B and RF AG in the sequence, although these samples may be a form of cpx (19/37/56) as they clustered among them on NFLG phylogenetic tree. Additionally, sample 88864 was classified as a recombination of subtypes B and G, while sample 99478 showed subtypes B and G and RF AG, recombination. Sample 85576 displayed a complex pattern of different subtypes A recombination among all tools and was suspected to be either 36_cpx or a URF. Similarly, samples 72786 and 82664 displayed a complex pattern of subtypes U, G, J, B, G and RF AG (for sample 72786) and subtypes G, L, U, A1 (for sample 82664), however they may be a form of cpx (06 or 11/13) based on NFLG phylogeny. Samples 80405 and 88192 were resolved as pure subtype B, with no indication of recombination events by any of the tools. Samples 99666 and 100545 served as positive control for pure subtypes B and C, respectively and were resolved as such.
Discussion
Here we have shown the successful implementation of a newly designed method for HIV-1 NFLG sequencing. This method was tested with clinical samples of whole blood plasma and shown to be relatively easy to perform in a clinical laboratory setting, while simple and less laborious than previously reported methods for HIV-1 NFLG.
Previous studies have utilized HIV-1 NFLG protocols successfully, some of the previously developed methods include Gall et al. 2012 [7], which developed a pan HIV-1 primer set, which generates 4 overlapping PCR fragments to cover the viral genome, followed by Roche/454 (Life Sciences, Branford, CT) sequencing. This method showed high sensitivity and reproducibility in detecting drug resistance and minor variants, and was able to identify CRF01_AE and CRF14_BG. Moreover, Grossman et al. 2015 [21, 22] have developed a two PCR fragments method that covers the entire HIV-1 genome, however this was tested only on subtypes B and C and RFs CRF01_AE and CRF02_AG. Others, such as Hebberecht et al. 2019 [23] developed a multi-step PCR (RT & Nested) based protocol, generating two amplicons followed by sanger sequencing. Their method managed to identify successfully CRF01_AE and CRF02_AG. Additionally, Lunar et al. 2020 [8] have utilized existing methods for HIV-1 NFLG sequencing to identified 6 new URFs and 3 more potential CRFs, in a country with limited HIV diversity. With regards to HIV-1 genome coverage, our newly designed method does not fall from previous protocols in obtaining NFGL sequence of a variety of viral clades including RFs.
Our newly developed method aims to simplify HIV-1 NFLG sequencing procedure, while capturing pan-HIV-1 genotypic variants, that are circulating globally. The approach described here has enabled the identification of HIV-1 RFs including CFRs and potential URFs. However, the method introduced here is lacking full coverage of highly variable regions of the HIV-1 genome like env and repetitive regions like the LTR. For comparability purposes all samples were aligned to HXB2 reference genome, potentially, aligning each sample to a reference genome that matches its viral clade may yield higher genome coverage. The amplicons developed method introduced in this pilot study, in optimal conditions, allocating maximal number of reads per sample, surpassed metagenomics in the percentage of HXB2 bases covered, it achieved higher coverage in all 4 compared samples of different HIV-1 pure subtypes and RFs (Fig. 2). Most of the metagenomics reads were mapped to the human genome (99.9%). As ribosomal RNA is the most abundant RNA in eukaryote cells, using ribosomal RNA-depleted RNA could potentially increase the coverage of viral genomes in a sample.
With regards to the env gene classification in all used computational tools, the confidence level of that region classification is low, due to lack of full coverage. The main reason for lack of full coverage in the highly variable env region and other regions of the viral genome is the genetic variability of the samples. The newly designed primers could not capture those regions, most likely due to mismatch of primers binding sites. Another possibility for the lack of full coverage could be a high proportion of incomplete viral genomes fragments representing defective virions circling in PLWH blood. Consequentially, its classification was difficult to analyze e.g. if a sample had a different subtype classification present only for a relative short number of bases in the env gene it was ignored, especially if it did not fit with the rest of the genome classification. Additionally, some analytic tools like jpHMM had a bias towards subtype B assignment for the env region. With further regards to jpHMM, this tool does not have RF AG as reference, therefore it classified RF AG as a combination of subtype A (mostly A1) and subtype G, consequently, its classification was ignored with regards to RF AG samples only. On the other hand, HIV-BLAST results should be compared with other methods, as the parts of the HIV genome covered by HIV-BLAST results may have a bias towards the major parent in a RF and parts of the unmatched genome may contain different subtype classification, that could potentially contain the minor parent of the recombination. The analytic approach chosen for recombination determination includes HIV-1 NFLG phylogeny where the alignment of sequences was not trimmed for poorly aligned regions, since the phylogeny included a genetically diverse population of HIV-1 NFLG comprised of RFs and pure subtypes, we decided to keep the alignment intact to maintain as much of NFLG as possible. Therefore, we have used NJ algorithm that does not tolerate any gaps and de-gaps the alignment prior to tree drawing. This may have a drawback that some poorly aligned regions may reduce the phylogenetic signal and skew the results due to poorly aligned regions. However, this method is faster and requires less computational power and does not require alignment curation prior to tree plotting, thus not losing potential recombination events in the alignment curation process. Furthermore, to complement the NFLG phylogeny, individual shorter phylogenetic trees (gag, pol, env) were plotted with alignment being cut to the specific genomic regions.
Conclusions
Here, we have successfully simplified the procedure for HIV-1 NFLG sequencing, still future maintenance of the primers designed here will be needed, as new HIV-1 NFLG CRFs and URFs sequences will accumulate over the years in publicly available databases. The primers presented here (Supp. table 1) should be tested on newly sequenced HIV-1 genomes and updated accordingly, to insure maximal capture of ongoing HIV-1 viral diversity with this approach. Overall, our newly developed amplicons approach has proven itself as reliable on this pilot study, yet it should be compared on a sample-by-sample basis with other HIV-1 NFLG methods to test its robustness and on a larger scale cohort and by other laboratories to be fully corroborated as a suitable approach for HIV-1 NFLG sequencing.
Supplementary Information
Below is the link to the electronic supplementary material.Supplementary file1 (DOCX 6784 KB)
