Haplotype Inference Using Long-Read Nanopore Sequencing: Application to GSTA1 Promoter
Vid Mlakar, Isabelle Dupanloup, Yvonne Gloor, Marc Ansari

TL;DR
This study shows that long-read nanopore sequencing can accurately determine GSTA1 promoter haplotypes without inference, improving clinical haplotype recovery.
Contribution
Demonstrates the efficacy of Oxford nanopore sequencing for accurate haplotype phasing in the GSTA1 promoter region.
Findings
Nanopore sequencing achieved >90% correct haplotype recovery for SNPs within 200 bp.
Sequencing accuracy dropped to 58% for SNPs 1089 bp apart, showing distance dependence.
Hybrid haplotypes were influenced by the number of PCR cycles, not extension or annealing time.
Abstract
Recovering true haplotypes can have important clinical consequences. The laboratory process is difficult and is, therefore, most often done through inference. In this paper, we show that when using the Oxford nanopore sequencing technology, we could recover the true haplotypes of the GSTA1 promoter region. Eight LCL cell lines with potentially ambiguous haplotypes were used to characterize the efficacy of Oxford nanopore sequencing to phase the correct GSTA1 promoter haplotypes. The results were compared to Sanger sequencing and inferred haplotypes in the 1000 genomes project. The average read length was 813 bp out of a total PCR length of 1336 bp. The best coverage of sequencing was in the middle of the PCR product and decreased to 50% at the PCR ends. SNPs separated by less than 200 bp showed > 90% of correct haplotypes, while at the distance of 1089 bp, this proportion still exceeded…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —Fondation CANSEARCH
- —University of Geneva
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEpigenetics and DNA Methylation · Digestive system and related health · Cancer Genomics and Diagnostics
Introduction
One single nucleotide polymorphism (SNPs) can have a considerable influence on gene expression and/or protein activity. However, several polymorphisms can affect the same target and their combined influence may determine the abundance and functionality of the resulting gene product. Such SNP combinations define haplotypes that tend to be inherited together. Pharmaco-genes encode enzymes which are important for the metabolism of exogenous compounds, including many drugs on the market. Several clinically important haplotypes affecting drug metabolic activity have already been identified and the establishment of corresponding drug dose adjustment guidelines has been proven beneficial to the patients [1–3].
Glutathione S-transferases are enzymes involved in the clearing of carcinogens, therapeutic drugs, environmental toxins, and toxic endogenous products by conjugation with glutathione [4, 5]. Although there are many classes of GSTs, the most abundant enzyme of this family found in the liver is GSTA1 [6, 7]. GSTA1 is involved in the clearance of important chemotherapeutic drugs including thiotepa [8, 9], doxorubicin [10], cyclophosphamide [11], and busulfan [12]. We previously demonstrated that six polymorphisms in the promoter region of GSTA1 are associated with busulfan clearance [13, 14]. These polymorphisms are found in different haplotype and diplotype combinations which modulate the expression of GSTA1 and influence the metabolic capacity of the cells [15]. Individuals can be stratified into three different metabolizer groups, slow, normal, and rapid metabolizers, based on their genotypes, to help personalize busulfan dose administration [15].
Up to now, haplotypes in the GSTA1 promoter region are mostly determined from non-allele discriminating double-stranded DNA sequences by inference based on linkage disequilibrium between different polymorphisms reported in knowledge databases such as the 1000 genomes project [15]. However, this approach can lead to ambiguous results if two possible haplotype combinations result in the same diplotype or even erroneous conclusions in the case of previously non-recognized combinations [15–17]. Therefore, more accurate methods for haplotype phasing and allele discrimination have been developed to improve clinical diagnostics. However, the techniques available so far involve either time-consuming molecular cloning or the sequencing of several close relatives, if available [18]. These inherent difficulties considerably increase the workload and time-lapse to obtain the results which makes true phasing often impractical in routine medical practice. With the advent of next-generation technologies for DNA analysis, sequencing of single DNA molecules and phasing has become more accessible. However, standard NGS sequencing platforms, such as Illumina, produce short reads that still hamper haplotype phasing. Recently, Oxford Nanopore introduced a new platform supporting long reads single DNA molecule sequencing able to address this problem. The Nanopore technology, reviewed by Midha et al*.* [19], has already been successfully applied to the phasing of HLA and CYP2D6 alleles [20, 21]. Other long-read sequencing technologies, PacBio’s SMRT sequencing [22] and single-tube Long Fragment Read (stLFR) using DNA co-barcoding [23] could also be used for phasing because of the ability to read long reads on a level of a single DNA molecule. Despite the promising outlook of SMRT sequencing and stLFR, the Oxford Nanopore presents currently the best alternative considering the sequencing fidelity, cost, and labor to produce phase SNPs.
In this paper, we apply the Oxford nanopore sequencing technology to recover the true haplotypes of the GSTA1 promoter region using a straightforward methodology. We propose a strategy accessible to most laboratories that would be applicable to other genomic regions where haplotype phasing is needed.
Methods
Samples and DNA Extraction
We selected cell lines from Coriell Institute (USA) for their high heterozygosity at our SNPs of interest according to the 1000 Genomes phase 3 data [24] (Table 1). DNA was extracted from the LCL cell lines using a DNeasy Blood & Tissue Kit (QIAgen, Germany) according to the manufacturer’s recommendations. LCLs were cultured in Roswell Park Memorial Institute Medium (RPMI) 1640 medium (Gibco, Carlsbad, CA) supplemented with 10% fetal bovine serum (HyClone, South Logan, UT) and 1% penicillin–streptomycin (Gibco) and incubated at 37 °C, 5% CO_2_-humidified atmosphere according to the manufacturer’s recommendations. The cell pellets for DNA extraction were harvested from cells passaged less than ten times [25].Table 1LCL cell line genotypes with inferred haplotypes from the 1000G data, genotype and diplotype phasing from nanopore sequencing dataCell lineInferred haplotype from the 1000G datars3957356-52rs3957357-69rs11964968-513rs4715332-567rs4715333-631rs58912740-1142Haplotype determined from NanoporeNA06985A1B1aG|AC|TA|AT|GT|GC|GA1B1aNA07056A1B1aG|AC|TA|AT|GT|GC|GA1B1aNA07357A1B1bG|AC|TA|GT|GT|GC|GA1B1bNA11840A1B1aG|AC|TA|AT|GT|GC|GA1B1aNA12043A2B1aG|AC|TA|AT|GG|GC|GA2B1aNA12762A1B1aG|AC|TA|AT|GT|GC|GA1B1aNA12872A1B1aG|AC|TA|AT|GT|GC|GA1B1aNA12874A1B1aG|AC|TA|AT|GT|GC|GA1B1a
PCR for Amplification of GSTA1 Promoter Region, Sanger Sequencing, and Diplotype Determination
The GSTA1 promoter region encompassing the six SNPs defining metabolizer classification was amplified using Platinum SuperFi II PCR master mix (Thermo Scientific, USA) with GSTA1-1336-F 5′-TGGATCCCTCAGTTTTGTAAGG-3′ forward and GSTA1-1336-R 5′-TAAACGCTGTCACCGTCC-3′ reverse primers (Microsynth, Switzerland) at a final concentration of 0.8 μM and 20 ng of DNA in final reaction volume of 20 μL. Standard GSTA1 promoter PCRs were performed under the following cycling conditions: Initial denaturation at 95 °C for 3 min, 40 cycles of denaturation at 95 °C for 30 s, annealing at 64 °C for 30 s, extension at 72 °C for 45 s, and final extension at 72°C for 1 min.
For the testing of different PCR conditions on phasing, the duration of extension at 72 °C was decreased to 35 s, 25 s, or 15 s while maintaining the annealing at 30s. Next, we also decreased annealing time to 20 s and 15 s at 15 s extension time. Finally, we changed PCR cycles from 40 to 35, 30, 25, 20, and 15 at 20 and 120 ng of genomic DNA at each condition and at 30 s annealing and 45 s extension time. The PCR reactions were analyzed using Tapestation (Agilent, USA) with D5000 screen tape (Agilent, USA).
The genotype of all cell lines was confirmed with standard Sanger sequencing (Fasteris, Switzerland) using the PCR products produced with 20 ng of DNA and annealing and extension for 30 and 45 s, respectively, and the following primers GSTA1-1336-F, GSTA1-1336-R, and GSTA1_seq-52_F (5′-TGACGCAAAGAGGATAGCAT-3′).
The definition of the GSTA1 promoter haplotypes and their corresponding star allele nomenclature is described in a previous work [15]. Reference *A1/*A1 is assigned by default to sequences without mutations at any of the 6 SNPs tested.
Oxford Nanopore Sequencing and Analysis
Sequencing was performed using Nanopore sequencing (Microsynth, Switzerland). 25 ng of PCR products was used for each sequencing reaction.
Guppy version 6.4.6. was used for base calling, demultiplexing, and trimming of the barcodes (Oxford Nanopore Technologies Ltd., 2000). Quality control and summary reports for nanopore reads were generated using nanoq version 0.10.0 [26]. The quality of sequencing reads was evaluated using FastQC version 0.12.1 [27]. The reads were aligned to the reference sequence of chromosome 6 in the human genome (GRCh38 assembly) using MiniMap2 [28], with our GSTA1 promoter region PCR extending from genomic coordinate 52803791 to 52805125 (GRCh38.p14, Chr:6, GSTA1). Variant calling was performed with clair3 [29] and haplotype phasing with WhatsHap [30].
Estimating the Proportion of Reads Carrying True Haplotypes
We quantified the proportion of reads carrying the correct haplotypes for all pairs of heterozygous SNPs in the GSTA1 promoter region. Briefly, we first extracted the corresponding nucleotides from the multiple alignments generated with minimap2 and sorted them with samtools version 1.17, using pysam [31]. We then estimated the proportion of reads that carried the correct haplotypes for pairs of variants using a series of steps in R version 4.3.2 [32]. Briefly, we loaded into R the reads ids with the carried nucleotide at each SNP location, merge all possible combinations and estimated the proportion of correct haplotypes out of all observed haplotypes. Before estimating this proportion, we discarded the reads carrying a nucleotide whose frequency was below 1/3 of the total number of reads covering that position.
Results
Nanopore Sequencing and Haplotype Phasing
On average, we sequenced 820 reads per sample, with an average length of 813 bp, which correspond to 60.9% of the length of the PCR product (i.e., 1336 bp) (Table 2). A large proportion of the reads (> 98%) were mapped to the genomic reference sequence of the GSTA1 promoter region (Table 2). We show in Fig. 1 an example of mapping for the sequencing reads corresponding to one of the samples. We see a random distribution of the sequencing errors (Figs. 1, 2A) in the genomic region of interest. Also, we see 100% sequence coverage in the middle of the PCR product with the coverage decreasing to roughly 50% at the 5′ and 3′ of the PCR product (Fig. 2B).Table 2. Oxford nanopore sequencing read mapping statisticsCell lineNumber of readsRead length: mean and range in bp% from total lengthNumber of reads mapped% of reads mappedNA069851039822.6 [37–9184]61.5103199.23NA07056505789.4 [136–9235]59.150299.41NA07357664813.0 [34–5462]60.965698.80NA11840693821.4 [43–6625]61.564893.51NA12043881816.9 [127–5266]61.186297.84NA127621108812.3 [40–9171]60.9109098.38NA12872776799.2 [83–4626]59.876698.71NA12874892835.5 [43–9417]62.588098.65Mean820813.860.980498.05Fig. 1Example of Nanopore sequencing result alignment. The figure shows the alignment to the GSTA1 promoter region in the GRCh38 reference genome of the reads obtained from Nanopore sequencing for one LCL cell line (NA07357). The view was obtained with the Integrative Genomics Viewer (IGV). The horizontal gray bars correspond to sequencing reads. The colored bars show differences between the nucleotide carried by reads and the corresponding base in the reference genomeFig. 2Control quality for Nanopore sequencing data. A Proportion of reads carrying the reference allele (NA07357). B Relative coverage over the genomic region (GRCh38.p14, Chr:6, 52803791 to 52805125) under investigation (NA07357)
The results obtained from the Nanopore sequencing were in complete agreement with the ones from the Sanger sequencing and 1000 genomes data (Table 1).
Estimating the Proportion of Reads Carrying True Haplotypes
We estimated the proportion of sequencing reads carrying the true combination of variants. We show that this proportion decreases as a function of the distance between two loci (Fig. 3A). For a distance below 200 bp, > 91% of the reads carry the correct SNPs combination. This proportion still exceeds 58% at a distance of 1′089 bp.Fig. 3. Phasing quality control. Proportions of reads with correct pairwise phasing for heterozygous SNPs in the GSTA1 promoter region. In A, we show how this proportion varies according to the distance between SNPs and genotypes: A1B1b are shown in red, A1B1a in blue, and A2B1b in green. In B, we show how this proportion varies according to the distance between SNPs and PCR conditions for the NA07357 cell line. We prepared a range of PCR products by gradually decreasing the annealing time (A) from 45 to 35 s, 25 s, or 15 s at a fixed extension (E) time of 30 s. We also decreased the extension time (E) from 30 to 20 s at the 15 s annealing time. In C, we show how DNA content and number of cycles influence the presence of hybrid haplotypes. We analyzed PCR products of 25, 30, and 35 cycles because they produced enough DNA for analysis. PCRs generated with 25 cycles produced fewer hybrid haplotypes than those with 30 or 35 cycles. Initial DNA content has little influence on the amount of hybrid haplotypes
Influence of PCR Quality on Phasing
We next sought to assess the impact of PCR reaction conditions on phasing using Oxford nanopore sequencing. Indeed, non-fully extended PCR products might act as pseudo primers and anneal to either allele in the next cycle, resulting in hybrids producing sequences containing artificial haplotypes. To evaluate the risk of error resulting from the generation of hybrids, we prepared a range of PCR products by altering the extension (E) and the annealing (A) times (Fig. 3B) and the altering number of cycles and content of DNA (Fig. 3C). The analysis by TapeStation showed no effect on the overall quality of the PCR products when changing reaction conditions. We observed the influence of the number of cycles and the initial quantity of DNA on the final PCR quantity. Fifteen and twenty cycles did not produce enough DNA for Nanopore sequencing. The analysis of Nanopore sequencing showed no effect of extension and annealing time on the production of hybrid haplotypes. On the other hand, the number of cycles had a significant impact on the presence of hybrid haplotypes. Taken together, for a robust PCR reaction, the duration of annealing and extension time did not affect the frequency of crossovers found in the PCR products (Fig. 3B); however, the number of cycles did (Fig. 3C).
Discussion
Here, we present a new strategy for phasing the GSTA1 promoter haplotypes. We show that when using the Nanopore long-range sequencing technology, we could accurately determine the haplotypes of all tested samples and demonstrate the robustness of the technique for this specific application. In the case of GSTA1 promoter genotyping, proper allelic discrimination is particularly important for the A1B1a/A3B2 haplotype pair that cannot be distinguished by standard Sanger sequencing. The results of Nanopore phasing confirmed that all six ambiguous samples carried the A1B1a genotype in agreement with the rare occurrence of the A3 haplotype in human populations. GSTA1A3 has been observed only in the Indian population of Gujarati at the frequency of 0.005. The haplotypes *A1 and *B1a have much higher frequencies in all other populations except the African population where *A3 has not been seen [15].
Previous studies on the role of GSTA1 promoter polymorphisms on the clearance and tolerability of busulfan treatments used mostly inferred haplotype/diplotype combinations [33], although we developed the molecular cloning strategies to experimentally determine phased haplotypes [15]. Determination of the diplotypes based on variant allele frequencies favors the A1B1a haplotype over the A3B2 one. According to our current results, most individuals are indeed likely to carry the A1B1a genotype, although the alternative should not be excluded. In terms of clinical impact, while the proper distinction between A1B1a and A3B2 might help refine the interpretation of the results and help understand the underlying transcriptional mechanisms, none of those diplotypes corresponds to an extreme metabolizer status associated with worse treatment outcomes [33].
The phasing was performed through direct sequencing of single DNA molecules corresponding to the GSTA1 promoter region amplified by PCR. The technique used here improves the process of phasing substantially compared to classical methods based on molecular cloning or the analysis of related individuals. Phasing is also important for other pharmaco-genes such as TPMT, where the 1/3A diplotype (normal metabolizer) or a 3B/3C (slow metabolizer) cannot be discriminated using non-allele specific techniques [16]. A proper TPMT-based stratification of patients is capital as the assignment of a patient to a wrong metabolizer group could potentially lead to severe consequences, due to improper drug dosage [17]. The Oxford Nanopore technology had been successfully used, in the past, to phase haplotypes in the HLA region genes or the CYP2D6 gene [20, 21].
We also tried to probe the potential limitations and pitfalls of the technique used in this paper. The single nucleotide sequencing errors appear to be distributed randomly. The quality of the DNA product obtained by PCR amplification is expected to be the main driver in this type of haplotype phasing. If the PCR product is not fully extended during the initial reaction cycles, the formation of hybrid crossovers might occur. The earlier such hybrid cross-over appears in the reaction, the higher the final proportion of those hybrids will be in the PCR product. Therefore, cycling conditions, particularly extension time during the PCR amplification, might be important. However, we showed that in our conditions, neither an extensive reduction of the extension nor annealing time influenced the formation of hybrids. Regarding GSTA1 promoter amplification, the haplotype rate remained stable at about 70% of true haplotypes even at distances above 600 bp. Moreover, reliable phasing being achievable at distances up to 600 bp, it is feasible to reconstruct correct haplotypes whenever the amplified region is sufficiently polymorphic. Optimal PCR primer design for the placement of phased SNPs in PCR products might also help as more central locations of PCR products have better sequencing coverage than the 3′ and 5′ ends. In addition to annealing and extension time, we analyzed the influence of the input DNA amount and PCR cycle number as other recent publications showed it can influence the hybrid haplotype formation [34–36]. Our results suggest that additional cycles after reaching the PCR plateau phase produce hybrid haplotypes and that additional cycles after the exponential phase of PCR should be avoided to minimize the generation of hybrid haplotypes. Initial DNA content did not influence the generation of hybrid haplotypes. Our current protocol, nevertheless, proved to be sufficiently robust to reliably phase GSTA1 promoter haplotypes. The techniques, sequencing results, and phasing quality in these publications are similar to ours [34–36].
In conclusion, we developed a simple and rapid strategy for allelic discrimination of GSTA1 promoter haplotype that can be easily implemented in most laboratories and is better suited for use in routine clinical practice than currently used technologies. The technique proved robust for the GSTA1 promoter and could be adapted to other haplotypes located in a genomic range that can be robustly amplified by PCR. Inter-SNP distances and frequencies are important determinants of correct phasing and can influence phasing strategy.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andrews, S. (2010). Fast QC: A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- 2Team, R. C. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
- 3Ansari, M., Curtis, P. H., Uppugunduri, C. R. S., Rezgui, M. A., Nava, T., Mlakar, V., Lesne, L., Theoret, Y., Chalandon, Y., Dupuis, L. L., Schechter, T., Bartelink, I. H., Boelens, J. J., Bredius, R., Dalle, J. H., Azarnoush, S., Sedlacek, P., Lewis, V., Champagne, M., …, Krajinovic, M. (2017). GSTA 1 diplotypes affect busulfan clearance and toxicity in children undergoing allogeneic hematopoietic stem cell transplantation: A multicenter study. Oncotarget,8, 90852–90867.10.18632/oncotarget.2031 · doi ↗ · pubmed ↗
