A nanopore-based HIV-1 reference epitranscriptome
Michael S Bosmeny, Adrian A Pater, Li Zhang, Lydia L Larkai, Beverly E Sha, Zidi Lyu, Masad J Damha, João I Mamede, Keith T Gagnon

TL;DR
This paper creates a detailed map of RNA modifications in HIV-1 using nanopore sequencing, offering a benchmark for future studies.
Contribution
The novel contribution is the first reference epitranscriptome for HIV-1 using nanopore technology with nucleotide-resolution modification mapping.
Findings
N6-methyladenosine and other RNA modifications were mapped at nucleotide resolution in HIV-1.
Modification patterns remained consistent under antiretroviral therapy and in HIV-1 virions.
m6A levels were conserved across circulating HIV strains in people living with HIV.
Abstract
Post-transcriptional modifications to RNA, which comprise the epitranscriptome, play important roles in RNA metabolism, gene regulation, and disease pathogenesis. However, mapping modifications and characterizing their function is often challenged by a lack of consensus on their presence and significance. The availability of reference epitranscriptomes to benchmark data would significantly advance epitranscriptomic studies. Toward this goal, we established a reference epitranscriptome for human immunodeficiency virus 1 (HIV-1), an important human pathogen. We sequenced a model HIV-1 genome from infected T cells using the latest nanopore technology. A sense and novel preliminary antisense HIV-1 epitranscriptome were generated, where N6-methyladenosine, 5-methylcytidine, pseudouridine, inosine, and 2′-O-methyl modifications were mapped by multiplexed base calling at nucleotide resolution.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4| Modification | modkit ID |
|---|---|
| m6A | a |
| m5C | m |
| Pseudouridine | 17802 |
| Inosine | 17596 |
|
| 69426 |
|
| 19227 |
|
| 19228 |
|
| 19229 |
- —Microbiology and Infectious Diseases
- —NIH10.13039/100000002
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA modifications and cancer · RNA and protein synthesis mechanisms · interferon and immune responses
Introduction
Epitranscriptomics is the study of post-transcriptional RNA chemical modifications and their functional effects on RNA in biology, disease, and beyond [1]. While epitranscriptomics is an established field of investigation, it is constantly evolving with the characterization of new modifications in unexpected places [2–4], methods to identify them [1, 5–7], and pathways they impact [1, 8, 9]. Nonetheless, experimental and bioinformatic tools like the m^6^A-Atlas [10, 11] or the MODOMICS database [12, 13], can provide useful reference points across studies. Viral RNAs, e.g. often possess their own unique epitranscriptomes that can either control viral pathogenesis, such as viral gene expression and replication, or modulate host pathways, like the innate immune response [8, 14–16].
The RNA modifications of human immunodeficiency virus 1 (HIV-1) represent one of the most well-studied viral epitranscriptomes [8, 17]. Most sequencing-based studies have focused on N^6^-methyladenosine (m^6^A) [16, 18–23]. Other modifications, including 5-methylcytidine (m^5^C) [24, 25], N^4^-acetylcytidine (ac^4^C) [26], pseudouridine (psi or Ψ) [27], inosine [28], and 2′-O-methylation (2′-O-Me or N_m_) [29], have also been investigated but are less understood. It is currently difficult to establish a clear consensus on the RNA modification landscape of HIV-1, especially beyond m^6^A, due to the diversity of experimental methods [30]. Several groups have mapped suspected regions for a single modification type at a time, often by an antibody-based method, but these do not have single-nucleotide resolution [19, 23, 31]. Other techniques that do have single-nucleotide resolution, such as bisulfite sequencing for m^5^C detection [25] or m^6^A-SAC-seq for m^6^A [20], can sometimes be challenging to execute and interpret.
Having a standardized methodology for identifying HIV-1 RNA modification that produces consistent results, can be utilized for multiple modifications, is straightforward to implement, and can be correlated across multiple studies as a benchmark would help advance the field [30]. Here, we establish a reference methodology and epitranscriptome for HIV-1 using a common model virus genome, NL4-3, and the Jurkat T cell line. In addition, two RNA library preparation methods, ribosomal RNA (rRNA) depletion and poly-A selection, are combined to provide more consistent coverage, followed by nanopore direct RNA sequencing (dRNA-seq) with the latest flow cell chemistry. The most up-to-date base calling algorithms from Oxford Nanopore Technologies (ONT) are applied for multiplex base-calling of m^6^A, m^5^C, psi, inosine, and 2′-O-methyl (2′-O-Me). We report miscalling errors with current ONT algorithms due to unique sequence and modification contexts and provide baseline subtraction and correction methods to address them. This resulted in a refined modification map of a sense and preliminary, less-explored, antisense epitranscriptome of HIV-1 that can serve as benchmark references. We validated m^6^A modifications using small-molecule inhibition of methyltransferase-like 3 (METTL3) and synthetic HIV-1 RNA fragments. We applied our approach to mapping modifications in Jurkat cells treated with physiological levels of current first-line combination antiretroviral therapy (cART) treatment, primary CD4+ T cells, virions from CD4+ T cell culture, and CD4+ T cells from people living with HIV-1 (PLWH), which indicated conservation of most modifications. We also identified differential modification patterns for spliced transcripts versus unspliced HIV-1 RNA using dRNA-seq data. Together, these results provide an initial reference HIV-1 epitranscriptome for multiple modifications and offer a roadmap for creating other reference epitranscriptomes using state-of-the-art ONT nanopore sequencing.
Methods
Viral production and quantification
All HIV-1 isolates were generated by transfecting HEK 293T cells using polyethylenimine (PEI; PolySciences). Briefly, for a 10 cm plate production of VSV-G pseudotypes with GFP reporter genes, 6 µg of HIV-GFP-ΔEnv and 4 µg of pCMV-VSV-G (gifts from Dr Thomas J. Hope) were combined as described previously [32], and for replicative HIV-1 pNL4-3 we used 6 µg pNL4-3 (a gift from the HIV repository) with 1 ml of opti-MEM (Life Technologies). Forty microliters of PEI was added to the mix, briefly vortexed, and allowed to incubate for 15 min before addition to HEK 293T cells. After overnight transfection, media was changed to fresh Dulbecco’s modified Eagle’s medium with 10% fetal bovine serum (FBS), 1× penicillin-streptomycin (P/S), and harvested at 48 h post-transfection through a 0.45 µm PVDF membrane filter. For replicative NL4-3, we concentrated the supernatants over a 20% sucrose cushion at 5600 × g overnight in 15 ml tubes. p24 ELISA was performed to quantify the viral preparations (R&D Systems, HIV-1 Gag p24 Quantikine ELISA Kit).
Jurkat cell growth and infection
Jurkat cells (HIV repository) were cultured in RPMI-1640 medium supplemented with 10% FBS, 1× P/S, and non-essential amino acids. Cells were split to 2 × 10^5^ cells/ml and split when they reached 1.5 × 10^6^ cells/ml. When infecting with replicative NL4-3 and pHIVdEnv-GFP, we used a high cell density of ∼2.5 × 10^6^ cells/ml with a working concentration of HIV p24 (determined by ELISA) of 500–600 ng/ml. Cells were harvested at 45–48 h post-infection, and TRIzol (Invitrogen) added at 1 ml per 1 × 10^7^ cells.
CD4+ T cell negative selection, activation, growth, and infection
In whole blood tubes from unidentified healthy donors or leukopaks (Red Cross Blood), we performed negative cell separation following the vendors protocol with EasySep Direct Human CD4+ T cell Isolation Kits or RosetteSep Human CD4+ T cell Enrichment Cocktail (Stem Cell Technologies) with Lymphoprep (Stem Cell Technologies) as density medium, respectively. Before infection, we activated T cells by adding anti-CD3 and anti-CD28 antibodies along with IL-2 following the methods described previously [32]. Activation was confirmed 3 days post-treatment by measuring CD4+/CD69+ double-positive cell percentage by flow cytometry. Briefly: Day 0, isolate CD4+ T cells from blood and activate T cells; Day 3, test activation % and infect CD4+ T cells with WT-HIV; Day 4, TRIUMEQ cART treatment when applicable; Day 6, harvest CD4+ T cells for TRIzol lysis, as well as quantify infection rate by flow cytometry with KC57 antibody (as described below).
Patient sample CD4+ T cell collection, preparation, and latency reversal
Approximately 10 tubes of whole blood were collected in heparin tubes, typically totaling 85 ml of whole blood. CD4+ T cells were negatively selected with RosetteSep Human CD4+ T cell Enrichment Cocktail (Stem Cell Technologies) and SepMate tubes with Lymphoprep (Stem Cell Technologies) as density medium, following the vendor’s protocol. Cells were activated with anti-CD3 and anti-CD28 antibodies as described above for 3 days. On day 3, cells were fixed for flow cytometry to quantify the percentage of positive p24 cells, as described below. The serum on top of the gradient media after centrifugation was collected. Viral particles were concentrated by ultracentrifugation with a 20% sucrose cushion in phosphate buffered saline for 4 h at ∼120 000 x g using an SW28 rotor (Beckman Coulter). After concentration, the supernatant was discarded and TRIzol was added to the viral pellets (2 ml total for the initial ∼85 ml of plasma). Patient sample collections were approved by the Institutional Review Board (IRB) of Rush University (FWA number: 00000482; ORA Number: 21072211-IRB01).
Drug treatments
To model current cART therapy, we added TRIUMEQ with the respective concentrations for each of the antiretrovirals based on reported plasma levels: ABC 140 ng/ml, 3TC 670 ng/ml, DTG 3.9 μg/ml [33–38]. For STM2457 (MedChemExpress) treatment, stocks were dissolved in dimethyl sulfoxide and the drug was added simultaneously to infection of NL4-3 at a final of 30 µM.
Flow cytometry
Flow cytometry was performed with BD CytoFix/Cytoperm per vendor protocol with the following antibodies: to measure activation, anti-CD69 (Fisher Scientific BDB560968) and anti-CD25; to measure Gag/p24 in the cytoplasm to report infection or viral production, KC57-RD1 (Beckman Coulter, Cat#6604667) antibody (or isotype). Samples were analyzed in a LSR Fortessa (BD Biosciences), gating for doublets with SSC-H/SSC-W and FSC-H/FSC-W gates.
RNA extraction and preparation
Total RNA was extracted from cell pellets using the manufacturer’s recommended TRIzol (Invitrogen) protocol. rRNA was removed by enzymatic rRNA depletion or poly-A RNA was selected by bead-based enrichment. For rRNA depletion, ~20 µg of total RNA was hybridized with DNA oligonucleotides (Supplementary Table S1) complementary to rRNA [39]. The solution was heated to 95°C and slow cooled to 65°C to enable oligonucleotide hybridization. RNase H (McLabs, HTRH-200) was added (200 U), and the solution incubated at 65°C for 5 min to degrade rRNA. DNA was then degraded using 150 U of Turbo DNase (Invitrogen, AM2239) at 37°C for 20 min. RNA was then purified using AMPure XP beads, typically yielding ~1000 ng of recovered RNA.
For poly-A selection, the NEBNext High Input Poly(A) messenger RNA (mRNA) Isolation Module (E3370S) was used following the manufacturer’s recommended protocol. Briefly, ~20 µg of total RNA was added to a solution containing poly-A-binding beads. After several wash steps, the bead-bound RNA is eluted by heating, typically yielding ~500 ng of recovered RNA.
Synthetic and in vitro transcribed HIV-1 RNA fragments
For small unmodified and m^6^A-modified RNA fragments, 60-nucleotide RNAs were prepared by solid-phase TBDMS chemistry, HPLC purification, and mass spectrometry confirmation following previously published methods [40, 41]. Commercially available TBDMS m^6^A phosphoramidite was used (Glen Research). For in vitro transcription, pNL4-3 HIV-1 sequence was used to create primers for overlapping amplicons that span the HIV-1 genome (Supplementary Table S1). Primers were created with or without a 5′ T7 promoter sequence. Polymerase chain reaction amplification resulted in T7 promoter-containing fragments of the genome of 2–3 kb size, which were then used as templates in T7 RNA polymerase transcription reactions to make either sense or antisense RNA HIV-1 fragments. Briefly, ~1 µg of template DNA was used in a 60 µl reaction following previously published conditions and enzyme preparations [42]. For transcripts containing modified nucleotides, native ribonucleotide triphosphates (NTPs) were fully replaced with modified variants from TriLink Biotechnologies, either N^6^-methyladenosine-5′-triphosphate (N-1013) for ATP, 5-methylcytidine-5′-triphosphate (N-1014) for CTP, or pseudouridine-5′-triphosphate (N-1019) for UTP. Reactions were incubated at 37°C for 90 min, then treated with DNase I for an additional 20 min at 37°C. RNA was isolated by phenol/chloroform extraction and ethanol precipitation. The resulting RNA fragments were combined and sequenced by nanopore in the same manner as cell-derived RNA.
Nanopore direct RNA sequencing
For rRNA-depleted samples, polyadenylation reactions were performed using Escherichia coli Poly(A) Polymerase (New England Biolabs, M0276L). Briefly, 1000 ng of rRNA-depleted RNA was treated with 0.375 U/µl E. coli Poly(A) Polymerase in 1× E. coli Poly(A) Polymerase Buffer and 1 mM ATP. The reaction was incubated at 37°C for 2 min and stopped by addition of ethylenediaminetetraacetic acid to a final concentration of 10 mM. The reaction was bead-cleaned using 2× RNAClean XP Beads (Beckman Coulter), washed twice with 200 µl of 75% ethanol, and eluted in 7 µl of nuclease-free water. Concentration was determined using the Qubit RNA HS Kit (Thermo Fisher Scientific).
Library preparation of the RNA samples was completed using the dRNA-seq kit (SQK-RNA004; ONT), following manufacturer’s recommended protocol. Briefly, RNA samples were quantified using Qubit RNA HS Kit (Thermo Fisher Scientific). One microliter SUPERase·In was added, as well as 1 µl of 130 µM pooled reverse transcription (RT) primers reported by Baek and colleagues [18] for sample 13B (patient 8 CD4+ T cell) and 14H (CD4+ virions), and the mixture heated at 65°C for 5 min, then cooled to 4°C. The RT Adapter was ligated by adding 3 μl of 5× Quick Ligation Buffer (New England Biolabs, NEB), 2 μl of T4 DNA ligase at 2000 U/μl (NEB), 1 μl SUPERase·In RNase Inhibitor (20 U/μl) (Thermo Fisher Scientific), and incubated at room temperature for 15 min. RT was performed using a master mix containing 1× First-Strand Buffer, 10 mM DTT, 0.5 mM dNTPs, and nuclease-free water. The master mix was added to the RT adapter-ligated RNA, followed by the addition of SuperScript III Reverse Transcriptase (Thermo Fisher Scientific). Reactions were incubated in a thermal cycler at 50°C for 60 min, and 70°C for 10 min, with a final hold at 4°C. The reaction was purified using 1.8× RNAClean XP beads (Beckman Coulter), washed twice with 75% ethanol and eluted in 20 µl nuclease-free water. RNA adapter (RMX) ligation was performed by mixing 20 µl of the reverse transcribed RNA, 1× NEBNext Quick Ligation Buffer, 6 µl RNA Adapter (RLA), 3 µl nuclease-free water, and 3 µl T4 DNA Ligase at 2000 U/μl (NEB). The mixture was incubated at room temperature for 15 min. The reaction was purified using 0.4× RNAClean XP beads (Beckman Coulter), washed twice with Wash Buffer (WSH), and eluted in 33 µl of RNA elution buffer. The final library was loaded on a PromethION RNA flow cell (FLO-PRO004RA) and sequenced on a PromethION P2 solo using MinKNOW (v. 24.06.10) with POD5 and live base-calling on. Coverage was monitored using RAMPART software (https://artic.network/rampart), with a goal of 50× coverage across the HIV-1 genome.
Modification calling
After the completion of sequencing, reads were called from the POD5 files with Dorado software (v1.0.0) with settings for RNA modifications (RNA basecalling model [email protected], https://github.com/nanoporetech/dorado). Generated reads were automatically aligned against a FASTA file containing human (hg38) and HIV-1 genomes (https://www.ncbi.nlm.nih.gov/nuccore/AF324493), generating a BAM file.
Dorado basecaller sup,m5C_2OmeC, inosine_m6A_2OmeA,pseU_2OmeU,2OmeG [pod5_folder_location] -b 408 –min-qscore 8 –device cuda:all –mm2-opts “-x splice” –reference [reference_fasta_location] > unsorted_reads.bam
After generation of the BAM file, samtools was used to sort and index the BAM file.
samtools sort unsorted_reads.bam > sorted_reads.bam
samtools index sorted_reads.bam
This sorted BAM file was then processed by modkit (https://github.com/nanoporetech/modkit) to generate a file containing modification sites on the HIV-1 genome. A filter-threshold of 0.7 was utilized to standardize results between runs.
modkit pileup [location_of_BAM] –ref [loc_of_reference] -t 18 –region NL43_AF324493.2_R2R –max-depth 10000000 –bedgraph ./ –filter-threshold .7 –motif DRACH 2 –prefix m6A
modkit pileup [location_of_BAM] –ref [loc_of_reference] -t 18 –region NL43_AF324493.2_R2R –max-depth 10000000 –bedgraph ./ –filter-threshold .7 –motif C 0 –prefix m5C
modkit pileup [location_of_BAM] –ref [loc_of_reference] -t 18 –region NL43_AF324493.2_R2R –max-depth 10000000 –bedgraph ./ –filter-threshold .7 –motif T 0 –prefix Psi
modkit pileup [location_of_BAM] –ref [loc_of_reference] -t 18 –region NL43_AF324493.2_R2R –max-depth 10000000 –bedgraph ./ –filter-threshold .7 –motif A 0 –prefix Ino
The four 2′-O-methylated bases can be similarly called with motifs “A 0,” “T 0,” “C 0,” and “G 0.” These modkit commands will generate multiple files per command, with the filename differentiating between the sense and antisense direction of the reference, and between the modification being identified, with each modification having an identification code:
**: **
The result is a bedgraph file containing the reference genome name, the position of the calculated modification, the percentage of reads that had the modification, and the total number of reads in the BAM file at that nucleotide position, e.g.:
NL43_AF324493.2_R2R 8988 8989 0.64912283 57
To remove noise from low-coverage nucleotides, a filter was applied to the bedgraph file, removing modification locations that had <10× coverage for that nucleotide position.
awk ‘$5 > 9′ m6A_original.bedgraph > m6A_10xFilter.bedgraph
The resulting bedgraph file can be viewed in IGV (https://igv.org/) or opened directly in text-editing software as a tab-separated value (TSV) file.
False discovery rate calculations
False discovery rate (FDR) and false positive rate (FPR) were calculated following a previous report [43]. Briefly, the per-site modification frequency is calculated for each Jurkat sample at all possible modification sites (e.g. m^6^A at all DRACH motif sites, m^5^C at all C sites, etc.). These frequencies were averaged together for each sample. This average for the in vitro transcribed, modification-free sample is called the FPR. The FPR divided by the averaged per-site modification frequency for the wild-type sample is the FDR. For our study, because we have multiple replicates of the wild-type sample, their averages were first averaged within themselves before being used to divide the FPR. For all calculations, see https://tinyurl.com/HIV-FDR.
Modification P-value calculations
P-values were calculated using Nanopore’s modkit software v0.6.1. First, each sample was processed using modkit pileup and set to output as a bgzip’d bedMethyl file.
modkit pileup –ref [loc_of_reference] -t 18 –region NL43_AF324493.2_R2R –filter-threshold .7 –-modified-bases m6A m5C inosine pseU 2OmeA 2OmeU 2OmeC 2OmeG –bgzf [location_of_BAM] [output_bedMethyl.gz]
Next, the file was indexed.
tabix -p bed [location_of_bedMethyl.gz]
After indexing, the bedMethyl file for the control sample, the T7 in vitro transcribed HIV-1 RNA lacking any modifications, is compared against the bedMethyl file for each sample, using the modkit dmr command:
modkit dmr pair -t 18 –reference [location_of_ref] –-header –min-valid-coverage 10 -m A -m T -m C -m G -a [invitro.bedMethyl.gz] -b [sample.bedMethyl.gz]
The resulting DMR file has a P-value for each nucleotide in the transcriptome where there is a modification. This P-value measures the difference in modification frequency observed between the control and experimental samples.
Baseline correction of modification calling
RNA fragments in vitro transcribed and sequenced as described earlier were modification called the same as other samples. RNA fragments in vitro transcribed without modified bases were used as a baseline for all other samples, since any modification calls would be considered false. For each RNA sample used in the study, there is a modification frequency for each individual modification at each individual nucleotide that this modification could be present at. To correct sequence pattern-based miscalls, the T7 in vitro transcribed values for a particular modification and nucleotide position were subtracted from the value found for a sample at each nucleotide position. If this subtraction brought the value below zero, zero was used instead of a negative value.
Separating splice isoforms
Full-Length Alternative Isoform analysis of RNA (FLAIR) software was used to identify the splice isoforms of HIV-1 reads and generate readID lists of these isoforms [44]. These lists were used to separate splice isoforms into separate BAM files that could be called for modifications individually.
Full scripts and reference files can be found at https://github.com/gagnonlab/HIV-dRNA-Seq. Briefly, the aligned BAM file generated previously from modification calling is processed by FLAIR to identify the splice isoform of as many reads as possible. If a read could fit multiple isoforms, it is excluded.
flair quantify -r manifest.tsv -i hiv_isoforms.fa –output quantify/_ –threads 18 –generate_map –stringent –isoform_bed hiv_isoforms.bed
manifest.tsv is a list of BAM files to be processed. hiv_isoforms.fa is a FASTA file of each of the splice isoforms, based on the NL4-3 HIV-1 genome with the added GFP tag. hiv_isoforms.bed is a BED file mapping the different splice isoforms. Both are generated from a GTF file containing the splice isoforms. This process generates a read.map.txt file that contains a list of each of the input splice isoforms, followed by a comma-separated listing of all readIDs that match that splice isoform.
The list of readIDs is processed using a bash script to separate out these listed readIDs from a BAM file into individual BAM files for each splice isoform. The full script can be found on the lab GitHub, https://github.com/gagnonlab/HIV-dRNA-Seq. Briefly, it converts the binary BAM file into a text-readable SAM file, copies the header of the SAM file into each individual splice file, uses the grep command to match reads from the full SAM file to readIDs from each splice isoform, and then copies these reads into the appropriate SAM output file. The output SAM files are converted back into BAM files and indexed. Afterwards, the resulting BAMs can be modification-quantified using modkit as above.
Results
Direct RNA sequencing provides multiplexed base modification calling on HIV-1 viral RNA
To create an accessible and reproducible methodology for a reference HIV-1 epitranscriptome, we selected a commonly used NL4-3 HIV-1 genome and Jurkat T cells. We also selected the ONT nanopore-based dRNA-seq platform. Nanopore sequencing is a cost-effective, commercial technology that is well-supported and can perform direct sequencing of RNA with minimal alteration and without amplification [45]. Importantly, dRNA-seq has been used previously to sequence HIV-1 viral RNAs [18, 46, 47], albeit with older flow cell technologies and limited baseline correction. However, ongoing improvements in flow cell chemistry, as well as bioinformatic base-calling tools that are continually improving, have made dRNA-seq an increasingly attractive off-the-shelf solution for multiplexed identification of RNA modifications. The combination of these model systems offers an opportunity to establish a benchmark methodology and data set that is robust, reproducible, and accessible to help advance HIV-1 epitranscriptomic studies.
Our approach was to infect Jurkat cells with non-replicative HIV-GFP-ΔEnv-VSV-G virus, which contains a partial replacement of the nef gene with GFP that also includes a frameshift in the env gene (hereafter referred to as NL4-3-GFP-VSV). Total RNA was collected and depleted of rRNA or poly-A selected, then directly ONT nanopore sequenced. We performed base-calling with ONT algorithms to calculate likely modification sites for m^6^A, m^5^C, psi, inosine, and 2′-O-Me. Initial experiments were performed in three biological replicates and demonstrated high reproducibility across the NL4-3-GFP-VSV HIV-1 genome, in particular for m^6^A (Supplementary Fig. S1). To validate modification calling, we treated Jurkat cells with STM2457 at the time of infection. STM2457 is an inhibitor of METTL3 catalysis [48], the primary methyltransferase that writes m^6^A onto RNA. Multiple doses of STM2457 were tested, with a 30 μM dose of STM2457 being sufficient to reduce detection of all m^6^A modification sites on NL4-3-GFP-VSV RNA to less than half their normal value (Fig. 1A). This result supported accurate m^6^A identification by the current ONT algorithm.
Probing and correction of HIV-1 base modifications called by nanopore dRNA-seq. (A) Nanopore modification-calling results for HIV-1 viral RNA extracted from Jurkat cell cultures treated with STM2457, a drug that inhibits METTL3 m6A modification activity. The position of each high frequency m6A nucleotide in the HIV-1 genome is indicated on the x-axis. (B) Comparison of modifications at DRACH motif sites where m6A was called at high frequency. The nucleotide position of the m6A modification within the DRACH motif is indicated on the x-axis. Upper plot: comparison of Jurkat cell samples infected with HIV-1 without (back row) or with (front row) 30 μM STM2457 treatment. Lower graph: comparison of Jurkat cell sample infected with HIV-1 before (back row) and after (front row) baseline correction. Comparison of modification calling between NL4-3 from Jurkat cells (C) and two synthetic HIV-1 RNA fragments, one unmodified (D) and one bearing m6A (E) at two DRACH motifs. The nucleotide position corresponding to the NL4-3 genome is indicated on the x-axis. All modifications called in panel (D) are considered incorrect while modifications called in panel (E), besides m6A at position 8975 and 8989, are considered incorrect.
Background subtraction and correction of HIV-1 RNA modification calls
In our initial data set, an unusual modification pattern was consistently observed at the 3′ end of the HIV-1 transcript, where the majority of m^6^A sites are predicted. Several of these m^6^A modification locations were reported by the base-calling algorithm to also possess significant levels of adjacent m^5^C and psi (Fig. 1B). METTL3 catalysis of m^6^A strongly prefers a DRACH motif (A/U/G, A/G, A, C, A/U/C). The m^6^A modification itself occurs at the central adenosine, which is always followed by cytidine and then often uridine. One possible explanation for such a modification pattern would be miscalling due to a sequence motif context or nearby modifications that influence the base-calling of others. Nanopore flow cells read nucleic acids in five-nucleotide “frames,” or k-mers, which shift one base at a time to generate current signals. This can lead to modification-calling artifacts by algorithms during computational interpretation of current signals, especially for uncommon sequence motifs or nearby unaccounted modifications. Unexpectedly, treatment with STM2457 not only reduced the m^6^A calling at these DRACH motifs, but also reduced the m^5^C call counts, suggesting a dependence on m^6^A presence (Fig. 1B). Conversely, the low frequency of psi calls in DRACH motifs did not respond to STM2457 treatment while a 2′-O-Me call did at one site, position 8570. Single-read analysis showed that m^5^C was present at these sites almost exclusively only when m^6^A was also present (Supplementary Fig. S2). Thus, we questioned the legitimacy of m^5^C and potentially other modifications in and immediately adjacent to DRACH motifs, as well as the overall fidelity of multiplexed modification base-calling. To better understand the dependence of m^5^C calls on adjacent m^6^A in DRACH motifs, we chemically synthesized RNA oligonucleotides bearing two adjacent DRACH motifs found in the NL4-3 genome, where m^6^A is called with high frequency, at positions 8975 and 8989. One oligonucleotide contained no modifications, while the other contained pure m^6^A at both positions. When these oligonucleotides were sequenced, we observed an increased frequency of m^5^C miscalling in the DRACH motifs when m^6^A was present (Fig. 1C–E). In addition, psi and 2′-O-Me were observed at low but detectable levels in unmodified RNA at certain nucleotide positions, as was m^5^C to a lesser extent, indicating background calling induced by sequence context (Fig. 1D and Supplementary Fig. S3A–C). To address the potential influence of sequence context on modification calling at a broader scale, we prepared overlapping multi-kilobase DNA templates from an NL4-3 plasmid (replicative variant without GFP) for in vitro T7 RNA polymerase transcription. Fragments were in vitro transcribed with standard nucleotide triphosphates and sequenced, revealing very minimal background miscalling for m^6^A, low miscalling for psi, inosine, and 2′-O-Me, and more substantial sequence-dependent miscalling for m^5^C (Supplementary Fig. S4). By using the modification-calling frequencies from in vitro transcribed NL4-3 fragments, we performed a baseline correction [18]. After correction, the original calling of m^6^A remained nearly unchanged, indicating that the base-calling algorithm is quite accurate (Fig. 2A, Supplementary Fig. S1, and Supplementary Table S2). In contrast, only a handful of high-confidence calls for m^5^C, inosine, and 2′-O-Me modifications were observed across the NL4-3-GFP-VSV viral RNAs after correction and using a 10% cutoff threshold (Fig. 2B, D, and E; Supplementary Tables S2–S5). Baseline correction resulted in no pseudouridine calling frequencies above 10%, removing all potentially significant pseudouridine modifications (Fig. 2C and Supplementary Table S6).
Nanopore-based modification calling of the most reliable, baseline-subtracted HIV-1 RNA modifications on viral RNA from Jurkat cells. The HIV-1 genome architecture is illustrated above. Modifications are shown if, after baseline correction, the average modification frequency was at least 10%. (A) m6A (blue), (B) m5C (green), (C) pseudouridine (psi) (yellow), (D) inosine (purple), and (E) 2′-O-methylation (orange). Inset in panel (A) is a close-up of the 3′ end of the NL4-3 HIV-1 genome where m6A is most densely called. Results are the average of three separate biological replicates. Error bars are standard deviation.
Statistical analyses of the modification calling algorithms produced different results for each modification. These analyses utilized the FPR and FDR methods from a recent study that evaluated the accuracy of older nanopore modification algorithms, with some small adjustments [43]. The method looks at both the per-read FDR, which considers at all modified and non-modified bases across all reads, and per-site FDR, which considers at all modified and non-modified bases for a single site, then averages all sites together. Because our reads had significantly higher coverage at the 3′ end of the HIV-1 transcript than the 5′ end, looking at the per-read statistics of our sequencing would artificially bias the behavior of the algorithm toward the sequences near the 3′ end of the transcript. We therefore chose to look only at the per-site FPR/FDR.
These calculations agree with our prior observations. m^6^A, at our 70% confidence threshold, has an FPR of 0.005 and an FDR of 0.130. Inosine has the second-lowest rates, with an FPR of 0.005 and an FDR of 0.852. Pseudouridine and 2′-O-methyl are more similar, with FPRs of 0.007 and 0.006 and FDRs of 1.012 and 1.035, respectively. m^5^C has the highest rates, with an FPR of 0.020 and an FDR of 1.198. These high FDR rates indicate that there are more modifications being called on the unmodified in vitro transcribed RNA than on the HIV-derived samples. This further reinforces that background subtraction, while creating the possibility for false negatives, is currently essential for avoiding numerous false positives.
Another method of analyzing the accuracy of these modifications involves using Nanopore’s modkit software to assign a P-value to each modification in each sample. These P-values can be used as cutoffs for deciding which modifications are significant, similar to the 10% frequency cutoff we utilized. A recent study utilized P-values < .00001 as their cutoff [49]. However, these P-values will ignore true-positive results that are low in frequency (Supplementary Fig. S5). There are multiple m^6^A modifications that occur at only 10%–15% frequency but are likely to be accurate due to their sharp reduction upon STM2457 treatment. These modifications are assigned P-values of .001 to .01 and would be ignored with a stringent P-value cutoff. For other modifications, where the baseline values are higher, a higher corrected value closer to 25%–30% frequency is required.
To consider how the presence of one modification influences the calling of another, the same overlapping multi-kilobase DNA templates were in vitro transcribed using modified bases to produce NL4-3 transcripts in which all adenosines had been replaced with m^6^A, all cytidines had been replaced with m^5^C, or all uridines had been replaced with pseudouridine. These heavily modified transcripts were also sequenced, and modification frequencies for the other modifications were called. Baseline correction was also applied to remove false modification calls that would be present without adjacent modifications. For example, the baseline of m^5^C from unmodified in vitro transcribed RNA was subtracted from the m^5^C modification calls from in vitro transcribed RNA with only m^6^A modifications to derive a list of incorrect m^5^C calls resulting from nearby true m^6^A modifications. Several trends in the current generation of Nanopore modification calling algorithms are exemplified in these analyses (Supplementary Fig. S6). m^6^A, psi, and inosine can be falsely called sporadically by other nearby modifications, but there is not a strong pattern. On the other hand, m^5^C is frequently miscalled immediately 3′ of m^6^A in DRACH motifs. Calling of 2′-O-Me appears to be the lowest accuracy when other modifications are overlapping or adjacent. For example, when m^6^A is present, Am on the same nucleotide or a Gm immediately 5′ of the A in the DRACH motif will frequently be miscalled. When m^5^C is present, Gm or Cm immediately 5′ can be incorrectly called. When psi is present, an Am or Cm directly 5′ can also be miscalled.
During T7 transcription, inosine will incorporate in the place of guanosine rather than adenosine, which would place it in the incorrect positions to compare against cell-based samples [50]. In order to examine the ability of the algorithm to correctly detect inosine, synthetic HIV-1 fragments were utilized. Two fragments, based on the reverse strand of HIV-1 in the asp gene with two suspected inosine sites, were chemically synthesized with or without inosines (Supplementary Fig. S7B). The sample without inosines correctly reported a lack of inosine calls. The sample with inosine at positions 6911 and 6916 in the fragment reported inosine at those sites with 95% frequency, similar to the other modification calling algorithms. There was no apparent miscalling of modifications adjacent to inosine sites.
To test the miscalling of modifications due to the presence of a 2′-O-methylated nucleotide, four ∼60 nt RNA oligonucleotides of HIV-1 transcript fragments were synthesized, each of which contained one or two 2′-O-Me nucleotides at suspected HIV-1 locations [29], and were compared against sequencing results from in vitro transcribed RNA fragments containing no modifications (Supplementary Fig. S3D and E). The result was correct calling of 2′-O-Me at expected nucleotides for synthetic 2′-O-methylated fragments as well as miscalls immediately 5′ and 3′ of those expected positions. m5C was also miscalled several bases 3′ of the expected 2′-O-Me nucleotides.
Using the current ONT algorithm, 2′-O-Me calls directly adjacent of another modification should be considered with caution. A prominent example is a Gm call at position 8570, directly 5′ of a called m^6^A. When STM2457 is used to reduce the amount of m^6^A present, calls of Gm-8570 also decreased substantially. Thus, this was considered a false call and removed from the final list of corrected modifications. Similarly, m^5^C modification calls at 8976 and 8990 are both 3′ of m^6^A modifications and have been removed. Well-defined sequences, especially viral transcriptomes and the genomes of RNA viruses, should be amenable to these methods of baseline correction.
Features of the corrected HIV-1 epitranscriptome
Several observations can be noted for the different modifications identified in NL4-3-GFP-VSV viral RNAs. For m^6^A, nearly all high-frequency modifications are found on the 3′ end of the genome, in approximately the last 1000 bases (Fig. 2A and Supplementary Table S2). These include a pair at positions 8079 (90.69% ± 0.28%) and 8110 (93.70% ± 0.73%), which occur in the region where rev and env genes overlap, and a trio at 8947 (48.58% ± 0.12%), 8975 (91.45% ± 1.34%), and 8989 (70.58% ± 1.05%) in the 3′ LTR just beyond the end of the nef gene. Evidence for m^6^A modifications in these areas is well-documented, with previous MeRip-Seq and PA-m^6^A-Seq experiments showing high probabilities of m^6^A in these regions [19, 21, 23, 24, 51]. One group used silent mutations to remove A nucleotides, either individually or altogether, with A8079G, A8975C, and A8989T mutations [18]. The triple mutation resulted in reduced HIV-1 protein expression. There is also a trio of m^6^A modifications in the middle of the nef gene [8571 (80.00% ± 1.02%), 8621 (81.09% ± 0.23%), and 8710 (47.73% ± 0.67%)], but it is possible that these methylations are artificially induced. The NL4-3-GFP-VSV genome possesses a GFP tag that replaces approximately the first 100 bases of the nef gene and the frequency of these modifications are dramatically reduced when a non-GFP genome is used, which we describe later. There is a single m^6^A modification [A125 (11.24% ± 5.72%)] that is present in the 5′ LTR. While it does not exhibit a high level of modification, it is notable for being the only reproducible and significant m^6^A not on the 3′ end of the genome. A previous study suggested an m^6^A in this region but lacked single-nucleotide resolution [19]. Finally, there is an m^6^A modification called at position 8034 (45.21% ± 0.97%). This modification is not shown in Fig. 2A because it is in a near-canonical DRACH motif, GGACG, in which the final G does not meet the (A/U/C) requirement. However, it is consistently called across all samples and shows decreasing frequency with increasing amounts of STM2457, suggesting it is an accurate call. This modification is presented separately in Supplementary Table S2.
After baseline correction, only two m^5^C modifications emerged (Fig. 2B). Position 8114 (25.11% ± 5.91%) in the rev/env region is the more consistent of the two modification calls. It is ~15%–30% frequency in all samples (Supplementary Table S3). Position 1551 (11.00% ± 9.85%) in the gag gene, by contrast, is more varied in m^5^C calling frequency from sample to sample. Variability between samples suggests the possibility that this site is more dynamically regulated or sensitive to cellular signals.
The ONT base-calling algorithm identified several potential psi modifications in uncorrected sequencing data (Supplementary Fig. S1). However, most of these were also called in the unmodified in vitro transcribed viral genome fragments (Supplementary Fig. S4), indicating they are likely to be incorrect calls. After removing background signal, there were no pseudouridine calls that met our 10% threshold cutoff (Fig. 2C). A lower threshold of 5% was used to generate a list of potential psi modification sites (Supplementary Table S6) for further investigation. Like m^5^C, there is higher variance in psi calling frequency across samples.
Even prior to background subtraction, inosine modifications were called infrequently (Fig. 2D). Most sites ranged between 10% and 20% modification. After background subtraction, only two positions reached a 10% threshold. Inosine may only be present on the HIV-1 sense transcript in low abundance and may not be a significant target of adenosine deaminases under these conditions. The highest frequency inosine modifications were 8037 (12.72% ± 2.11%) (env) and 8568 (12.67% ± 6.13%) (nef). Both of these modifications decreased in the STM2457-treated sample, despite not being immediately adjacent to an m^6^A (Supplementary Table S4). The inosine called at position 8568 is closely located between an m^6^A at position 8564 and 8571. It is possible that the k-mer frame of the sequencer produces an incorrect call despite being a few bases away from m^6^A on either side. The nearest m^6^As to position 8037 is the near-canonical DRACH motif m^6^A at position 8034 described earlier. There may also be a biological explanation for this inosine frequency change. Recent studies have suggested that m^6^A may play a role in the ADAR conversion of adenosine to inosine by methylating the ADAR mRNA, resulting in increased protein production [52, 53]. In that case, inhibition of METTL3 would reduce ADAR1 production and reduce inosine in HIV-1 RNA. More investigation is needed to fully understand the potential co-dependence of inosine on STM2457 treatment.
Calling of 2′-O-methylation is the latest algorithm to be added to Nanopore dRNA-seq. At present, it seems to be less reliable than other algorithms. Nonetheless, there are four sites where 2′-O-Me calling passes a 10% threshold: 1749 (11.62% ± 10.28%) (gag/pol), 3107 (11.36% ± 9.82%) (pol), 9162 (20.08% ± 6.39%) (3′ LTR), and 9164 (15.22% ± 11.65%) (3′ LTR) (Fig. 2E and Supplementary Table S5). At each site, the frequency of methylation appears variable. However, we observed that samples from Jurkat cells grown at the same time have similar frequencies of methylation, suggesting sensitivity to cellular conditions. A previous study of 2′-O-methylation in HIV-1 used RiboMethSeq and radiolabeling to identify seventeen potential 2′-O-Me sites [29]. None of these sites were identified by the Nanopore 2′-O-Me calling algorithm. To determine if this was a sequence-specific negative miscalling, five of these sites were synthesized on ∼60 nt RNA fragments, as previously mentioned (Supplementary Fig. S3E). Each of these sites was correctly called at a high frequency, indicating that there is no sequence-dependent issue with the modification calling algorithm at these locations. RiboMethSeq and radiolabeling, however, are not quantitative. It is possible that these modifications are present, but only at low frequencies.
Transcript splice isoforms possess distinct RNA modification frequencies
To examine potential differences in modification patterns between spliced HIV-1 transcripts, a sequencing run of HIV-1 from Jurkat cells with the highest number of full-length reads was used to sort reads into spliced transcript isoform bins using the FLAIR software package [44]. HIV-1 transcripts can be full-length, or large sections spliced, such as the gag/pol region [54]. There are two splice site exons, S1 and S2, which can optionally be included in splice variants [54]. Not including new combinations produced by adding one or both splice donors, there are approximately six primary isoforms aside from the full-length, unspliced form [54, 55] (Fig. 3).
Differential modification frequencies in HIV-1 RNA splice isoforms. Comparison of modification frequency in each splice isoform as compared to the full-length (unspliced) isoform. X-axis represents individual nucleotides in the HIV-1 genome. Y-axis represents percentage of reads that contained the mutation of interest for that nucleotide, relative to the unspliced form. Bar color indicates modification type: m6A (blue), m5C (green), pseudouridine (yellow), inosine (purple), and 2′-O-methyl (orange). Only modifications with a difference of 10% from the unspliced RNA in at least one isoform are shown. Shaded background shows portion of the full transcript retained in the spliced isoform. Splice isoforms are subsets of direct-RNA sequencing sample 7C. The A6 splice site is exclusive to the HXB2 genome and therefore not indicated here.
After sorting, reads were further split and grouped into each splicing variant, and modifications were called independently for each variant, including background subtraction as described earlier. The full-length, unspliced read, which codes for the whole genome, including gag/pol, was used as a comparative baseline. Modifications in which there was at least a 10% difference in modification frequency at that nucleotide between the baseline and one of the other splice isoforms are plotted in Fig. 3 and listed in Supplementary Table S7. Several trends in differential spliced transcript modifications are notable. First, many are near the splice sites. For example, both vif and vpr splice isoforms have an m^6^A modification at position 5160, a 2′-O-Me at 5750, and a psi at 6242 that are increased as compared to the unspliced transcript. Similarly, another m^6^A modification at position 5434 is near the beginning of tat and increased in tat splice isoforms compared to the unspliced transcript. There is an m^5^C modification near the 5′ LTR at nucleotide position 265 that is of higher frequency in the unspliced transcript than any spliced isoform. One differential modification, m^6^A at 5006, is in the S2 splice donor, which is be incorporated into multiple splice isoforms and most abundant in the tat spliced transcript isoform.
The remainder of differential modification calls are in transcript 3′ LTRs. These include m^5^C at positions 9008 and 9019, both of which are called most frequently in the vpr splice isoform. A psi at position 8851 is nearly absent in the unspliced form but most frequently called in vif and rev isoforms. A 2′-O-Me at position 9166 (adjacent to constitutively methylated bases at positions 9162 and 9164) is most frequently called in the vpr isoform and least in the vif isoform. Most of these modifications are of relatively lower frequency, such that they did not reach the 10% average baseline-corrected threshold. However, the m^6^A at position 8975 is very prominent. All splice isoforms combined yielded a calling frequency of 89%, the unspliced form had the highest at 94%, and the rev isoform had the lowest at 82%.
To explore whether any of these modifications might have an effect on splicing, the list of modifications was compared against a list of known splicing enhancers or silencers [55]. Those modifications that fall within a splicing regulatory element are listed in Supplementary Table S8. The m^5^C modification called at position 265 is within ESE U5, which is speculated to be bound by SR protein SC35 (SRSF2) to enhance the activation of splice donor D1 [56]. m^5^C is most frequently expressed at this position in the unspliced isoform, suggesting possible influence on SRSF2 binding and inhibition of splicing. The m^6^A modification at position 5006 is within HIVE3D3, a motif shown to be bound by high mobility group A protein 1a (HMGA1a) [57]. This binding leads to an increase in the amount of vpr splice isoform made, as it promotes the activity of splice acceptor A2 and silences the activity of D3. However, for our sample, there does not seem to be any impact of this m^6^A modification on this D3 splice activity since the vpr splice isoform has similar m^6^A levels at 5006 as other splice isoforms. Finally, the m^6^As called at positions 7983 and 7989 are not significantly differentially expressed between splice isoforms despite falling within the ESE-3 splice enhancer. ESE-3 enhances the splicing of splice acceptor A7, which is utilized by the tat, rev, and nef splice isoforms [58]. Thus, these modifications are unlikely to play a role in the regulation of their cognate splice site.
A preliminary HIV-1 antisense epitranscriptome
Nanopore dRNA-seq and analysis also enables differentiation of transcript orientation, which allows for modification calling on antisense transcripts like those utilized by the asp gene [59]. However, antisense transcripts are very low in abundance, especially outside of the asp gene. In the sequencing performed in this study, antisense reads make up only 0.4% of all HIV-1 reads. To interrogate the antisense genome, we bioinformatically combined antisense reads from eleven Jurkat cell samples. This produced coverage ranging from 10× to 50× across the entire HIV-1 antisense transcriptome. Because these reads have lower coverage and are combined from multiple samples, results may be more variable and should be considered more preliminary. To reduce the background noise, we performed corrections with unmodified in vitro transcribed antisense RNA fragments and only considered modification calling when at least 10× coverage was present with a calling frequency of >20% (Supplementary Table S9). However, modification calling within the asp gene (positions 6910–7479) was set to a lower cutoff threshold of 10% frequency due to higher average coverage, its gene status, and previous investigation of this asp transcript modification [60]. There were no m^6^A calls >20% frequency in the antisense epitranscriptome and only one such call at just under 20% in the asp gene (Fig. 4A). Conversely, inosine modifications, which were rare on the sense epitranscriptome, occur at multiple positions in the antisense direction with modification frequencies at 20%–30%. The modification called with highest frequency was inosine at positions 95 and 118, with 28% and 30%, respectively (Fig. 4A and Supplementary Table S9). Within the asp gene itself, the most prominent modification is inosine at 6911. While there are no significant psi modification calls on the sense transcript, there are two on the antisense transcript, at positions 1093 (53.85%) and 3104 (20.80%). All five modifications that can be currently called with ONT algorithms, m^6^A, m^5^C, psi, inosine, and 2′-O-Me, were previously identified by mass spectrometry at low or moderate abundance in asp gene antisense transcripts [60]. Modifications to the antisense transcripts have not been previously site-specifically mapped and may play a role in the processing or function of the HIV-1 antisense transcriptome.
A preliminary HIV-1 antisense epitranscriptome, the effect of cART and cell type on HIV-1 modification calling, and conservation of m6A in HIV-1 genomes from PLWH. (A) Nanopore modification-calling for HIV-1 antisense RNA from Jurkat cells. m6A (blue), m5C (green), pseudouridine (yellow), inosine (purple), and 2′-O-methylation (orange). Inset shows a close-up of the asp gene. (B) Nanopore modification-calling results for HIV-1 RNA taken from Jurkat cells treated without (darker color) or with (lighter color) cART treatment. Error bars represent the standard deviation from three separate biological replicates. (C) Nanopore modification-calling results for HIV-1 RNA taken from Jurkat cells, primary CD4+ T cells infected in vitro, supernatant of primary CD4+ T cells infected in vitro, and CD4+ T cells from PLWH samples. In each nucleotide position cluster, the first bar is Jurkat cell samples infected with HIV-1, the second bar is CD4+ T cells from healthy donors and infected with HIV-1, the third bar is the supernatant from the CD4+ T cells from healthy donors, and the fourth, fifth, and sixth bars are samples taken from CD4+ T cells from PLWH donors. (D) Top: comparison of m6A modifications called from HIV-1 RNA from Jurkat cells against preservation of the DRACH motifs for these m6A modifications sequenced from three PLWH samples. Bottom: analysis of conservation of known m6A modification sites in a larger dataset of HIV-1 mutations. Positions containing m6A between nucleotide positions 8000 and 9171 were identified and the average and median Shannon entropy for these positions are reported.
RNA preparation and cART treatments do not alter modification calling
During sample preparation, two methods of RNA enrichment were performed, a bead-based poly-A selection method and an RNase H-based rRNA degradation method. Each method yielded a different configuration of RNA fragments for sequencing, allowing a greater diversity of reads (Supplementary Fig. S8A). Poly-A selection produced relatively long reads from dRNA-seq that usually began at the 3′ end of the viral genome, resulting in high 3′-end coverage but low coverage across the middle and 5′ regions. On the other hand, rRNA depletion resulted in smaller RNA fragments that were well-distributed across the NL4-3-GFP-VSV genome. In addition, we observed that poly-A selection alone does not capture host non-coding RNAs efficiently, while rRNA depletion does not cover protein-coding mRNAs sufficiently (Supplementary Fig. S8B,D-F). Thus, by combining both methods, either experimentally before sequencing or bioinformatically after, coverage across viral and host transcriptomes was optimized.
There is the possibility that these selection methods could result in modification-calling bias due to RNA species enrichment. Therefore, we compared modified base calling between these two methods. The same sample was prepared by both methods separately and sequenced independently. The 3′ end of the HIV-1 genome was compared for modifications, as this region received reasonable coverage under both methods. Our analysis shows little difference between methods when calling modifications for HIV-1 (Supplementary Fig. S8C). Therefore, it is anticipated that using one or both RNA preparation methods should not significantly alter modification calling for HIV-1.
We also sought to examine the effect of cART. Three replicates of Jurkat cells, infected with NL4-3-GFP-VSV virions, were treated with cART prior to RNA collection. These samples were processed and sequenced as before. We observed no obvious differences in modification calling between samples with and without cART treatment (Fig. 4B). This suggests that cART does not substantially perturb the enzymatic modification pathways in these cells and is not a clear modifier of the HIV-1 epitranscriptome.
The HIV-1 epitranscriptome in CD4+ T cells from people living with HIV
To assess the ex vivo and in vivo epitranscriptome of HIV-1, including derived from PLWH, we first performed sequencing of HIV-1 viral RNA from cultured primary infected CD4+ T cells, which were isolated from whole blood of healthy donors. CD4+ T cells were activated and subsequently infected with replicative wild-type NL4-3 HIV-1. The supernatants of these cultured CD4+ T cells were also collected for HIV-1 virions. We then harvested CD4+ T cells from whole-blood samples of three PLWH individuals. CD3/28 was used to activate the CD4+ T cells for 3 days, and then RNA was extracted for dRNA-seq. The significantly lower amount of HIV-1 viral RNA in these samples required longer sequencing. For one sample, patient 8, we also employed a method used by Baek and colleagues, inclusion of HIV-1-specific RT primers [18]. However, this did not significantly boost coverage. Thus, only 3′-end transcript coverage was achievable for PLWH samples. For modification sites where we obtained sufficient coverage, we compared their levels to that of HIV-1 RNA isolated from Jurkat and CD4+ T cells grown in culture, as well as virion particles isolated from CD4+ T cell culture (Fig. 4C). In general, modifications were found at the same positions and at similar frequencies compared to Jurkat cell culture samples. However, m^6^A modifications in cell-culture-infected CD4+ T cells and virion samples were called at a lower frequency at nearly all positions. We noted that m^6^A at positions 8564, 8571, 8621, and 8710 were markedly higher for Jurkat cells as compared to other samples. The most likely explanation for this discrepancy is the presence of an altered sequence, specifically partial replacement of nef with GFP, in the NL4-3-GFP-VSV genome.
For each patient sample sequenced, a new consensus reference sequence was also generated. This allows for the identification of modification sites introduced by indels and SNPs. Due to a lack of modification-free background data for these altered sequences, only m^6^A, which has been shown to need the least baseline correction, was quantified. An example of these additional DRACH motif sites is shown in Supplementary Fig. S9, which shows a comparison of the standard NL4-3 HIV-1 genome and that of patient 8, which has two insertions at the 3′ end of the nef gene. These insertions, along with multiple SNPs, introduce four new DRACH motif sites and a new stop codon. Three of these four new DRACH motif sites show evidence of m^6^A modification. However, these modifications are not included in Supplementary Table S2, as they are only present in one sample of the study and are not representative of the standard reference. Nonetheless, they support the importance of m^6^A as a modification used to potentially modulate activity of HIV-1 variants.
The conservation of m^6^A modifications, as well as the DRACH motifs where they occur, may hold functional significance. One consideration is the rate of mutation at these sites as compared to other nucleotide positions across the HIV-1 genome. HIV-1 is known to have a high mutation rate, one of the reasons why developing treatments is complicated [61]. The frequency of m^6^A modification was therefore correlated with the sequences covered in our PLWH samples (Fig. 4D). Between HIV-1 genomic positions 8000 and 9172, there are 26 DRACH motifs. Of these, sixteen were not mutated in any of the three samples. This includes eight high-frequency m^6^A modification sites. Of the ten DRACH motif sites that were disrupted in one or more PLWH samples, only A8710 showed a significant m^6^A modification frequency (47.7%) when extracted from Jurkat cell samples, but not in cultured CD4+ T cells and virions.
We next compared these m^6^A sites against the Nextclade HIV-1 (HXB2) dataset, which comprises nucleotide diversity of ~1000 HIV-1 genomes, and the Los Alamos National Laboratories HIV-1 dataset (https://www.hiv.lanl.gov/), which comprises nucleotide diversity of ~5000 HIV-1 genomes [62]. For a given nucleotide position, the Shannon entropy was calculated based on the variability of nucleotides among the aligned HIV-1 sequences in the dataset, on a scale of 0–10, with 10 representing equal distribution of all four nucleotides. A typical nucleotide position will have an entropy value of ~0.25–0.28. The highest entropy at any nucleotide position was 1.43. In contrast, lower levels of entropy, such as 0.025, are associated with essential genetic components, such as the start codons for HIV-1 proteins. When surveying the eight high-frequency m^6^A sites, the average entropy of the NextClade and Los Alamos Labs datasets were 0.111 and 0.105, respectively, which are lower than the average rates (Fig. 4D). Together, these results suggest that these residues may be conserved for their m^6^A modification status and therefore play important roles in viral function and replication fitness.
Discussion
The role of chemical modifications to HIV-1 viral RNA in biology and disease, now referred to as epitranscriptomics, has been a growing topic of investigation for over three decades [28, 63]. Sequencing-based methodologies to both identify the types and map the locations of RNA modifications have significantly improved our understanding of their potential roles [8, 17, 64, 65]. These findings should enable sequence-specific manipulation of modifications for mechanistic studies, as well as potential therapeutic strategies [18, 66–73]. Unfortunately, the diversity of sequencing-based methods, among other variables, has made it difficult to correlate findings across multiple studies or confidently probe the role of modifications. Standard inclusion of common experimental or bioinformatic controls would allow for straightforward normalization and correlation across many studies, providing a standard reference to benchmark against. Such a reference epitranscriptome could help ensure that results were reproducible, rigorous, and well controlled, and advance the study of HIV-1 epitranscriptomics by providing an initial level of ground truth.
Here, we sought to establish an HIV-1 reference epitranscriptome to strengthen data interpretation, better integrate results with those of other research groups, and create a foundation for HIV-1 epitranscriptomics to build upon. Commercial nanopore dRNA-seq reagents, workflows, and bioinformatic tools are broadly accessible and have undergone significant improvements. Thus, nanopore can provide a common tool, and dRNA-seq results can offer a reproducible reference for detection of RNA modifications on HIV-1 viral RNAs. Existing data can also be reanalyzed when new algorithms become available, making it possible to map new modifications by simply repeating base-calling analyses or improve accuracy as algorithms evolve. In addition, current algorithms enable the equivalent of multiplexed modified base-calling, dramatically reducing resource demands. Together, these features make nanopore-based sequencing for epitranscriptomics cost-effective and data-rich [74]. When combined with commonly used reagents, such as widely accepted cell lines and genomes, a nanopore-based reference epitranscriptome becomes broadly useful.
We compared RNA preparation techniques, effects of cART, different cell lines, and virions, as well as characterized modifications in HIV-1 viral RNA samples from PLWH. We also constructed the first HIV-1 antisense epitranscriptome, which could provide unexpected insight into poorly understood HIV-1 antisense transcript biology. We found that nanopore-based dRNA-seq was generally robust and reproducible across replicates and sample and cell types. We discovered chemical modification features that have not been previously observed, such as the depletion of m^6^A but enrichment of inosine and psi on antisense HIV-1 viral RNAs, which appears to be the inverse for sense viral RNAs. We also discovered m^6^A and other modifications, albeit at a lower frequency, on HIV-1 viral RNAs from PLWH. These appear to be conserved when comparing NL4-3 to naturally circulating genomes, though their frequency at individual nucleotide positions may reflect cellular and environmental contexts. Treatment with cART had no effect on modifications called at high confidence, suggesting that these drugs do not significantly alter the recognition and modification of HIV-1 RNAs.
Our results provide implications for individual modifications in HIV-1 viral RNA. While the presence of m^5^C was previously reported by Cullen and colleagues using an antibody capture approach [24], another group was unable to detect the same m^5^C by bisulfite sequencing methods [25]. Our results suggest that m^5^C is present but at relatively low levels at only a handful of positions, possibly offering an explanation to the two conflicting reports. The m^5^C, inosine, and 2′-O-Me sites that remain at sufficiently high frequency after baseline correction may be worth further investigation. They are relatively distant from other modifications, suggesting they are not miscalled due to neighboring modifications, at least for those currently detectable by ONT algorithms. The high accuracy and precision of m^6^A calling we observed with ONT algorithms, especially at the 3′ end of the genome, combined with apparent conservation in naturally circulating HIV-1 strains, strengthens the functional significance of m^6^A in HIV-1 infection, host innate immunity, replication, and gene expression [8, 17, 64, 65].
We observed differential modification-calling frequencies on transcript splice isoforms. It is possible that splicing presents alternative accessibility to modifying enzymes due to RNA structure or protein interactions. Conversely, RNA modifications could alter splicing patterns by the same principles of altered RNA folding or protein binding. The potential order of events remains unclear, however, especially because some RNA modifying enzymes, like METTL3, are known to be nuclear localized [75]. Nonetheless, the ability to detect splice isoform-specific modifications with nanopore will be useful in the study of HIV-1 regulatory mechanisms at the RNA or translation level, which help control the stoichiometry of viral factors [54, 55].
As these epitranscriptomic modifications of HIV-1 RNA are well conserved, it will be interesting to determine their specific roles in future studies. HIV replication is an intricate process divided into the early and late steps of infection, where both the incoming and the produced unspliced/packaged or spliced and translatable RNA are mechanistically independent and with different functions. We speculate that the removal of specific modifications to the RNA will result in either the lower production of HIV from infected cells or result in lower infection rates and different activation patterns of innate immunity-related pattern recognition receptors such as MDA5 or RIG-I [76, 77]. Also, it will be interesting to determine if compensatory mutations can arise in RNA enzymatically stripped of specific modifications.
The accuracy of algorithm-based multiplexed modification calling remains a challenge for nanopore-based epitranscriptomics [74]. We have demonstrated that identification of modifications can be influenced by both sequence context and neighboring modifications, which are difficult to fully account for. To partially address this, in vitro transcribed or chemically synthesized RNA fragments, with and without modifications, can be used to perform corrections and baseline subtractions. These analyses should be useful for better algorithm training, especially for specific, defined transcriptomes like HIV-1, and represent a general tool for establishing nanopore-based reference data sets. However, the effect of neighboring modifications on modification calling is not well understood and can affect the fidelity of identification in complex samples.
Algorithms can only be expected to account for modifications and contexts they have been trained with. Similarly, new versions of algorithms, aimed at reducing noise, may lose sensitivity as a tradeoff for accuracy. During this study, ONT released the newest version of their RNA modification calling algorithm. We initially used the previous version to generate modification frequencies for all samples. The background of each modification under that algorithm (v5.1.0) (Supplementary Fig. S10) compared with the new algorithm (v5.2.0) (Supplementary Fig. S4) was significantly higher. However, the reduction in m^5^C and pseudouridine background calling with the new algorithm resulted in substantial loss of modifications at the baseline-corrected 10% threshold. Changes in significant modification calling (Supplementary Table S10) exemplify the variation that can occur across algorithms.
Several limitations apply to this study. First, modification calling is limited by the number of available reads. A low number of reads can result in distorted measurements of the fraction of reads modified at any given position. In addition, modification calling relies on the reads matching the reference sequence. If there is an indel or SNP it will be ignored by the nanopore modification calling algorithm. Sequence-dependent baseline subtraction of modification calls also depends on sequences matching. Thus, for patient samples bearing a number of mutations as compared to the reference, a new consensus sequence needs to be generated. For highest accuracy, a new baseline of the area surrounding the mutations should also be constructed. Regarding interpretation of results, cART treatment of cell culture does not perfectly mimic the long-term effect of suppressive therapy on PLWH and the possible effects of escape and compensatory mutations on the HIV genome. In addition, modification results are only candidates selected by the algorithm and require further confirmation, ideally by orthogonal techniques. While the creation of a preliminary antisense epitranscriptome is new, it will require additional investigation. The low number of copies available for sequencing create technical challenges and the contribution of antisense transcripts, beyond the asp gene, and their chemical modifications to HIV-1 biology and pathology remain unclear [59, 60, 78, 79].
Here, we demonstrate the straightforward utility of generating reference data sets that can establish a baseline for modification calling in a specific virus. Once established, this reference enabled rapid evaluation of sample types, conditions, and treatments to correlate modification patterns with potential function. Creating algorithms specific to the transcriptome of a virus, pathogen, or cell type, combined with synthetic modification contexts or mass spectrometry that defines the potential modification landscape, represents a practical next step in the evolution of high-fidelity epitranscriptomics. Furthermore, combining reference data sets in the future from multiple viruses or cell types should enable training of algorithms that can better address the shortcomings of multiplexed modification calling accuracy. Indeed, RNA from most samples will possess many modifications that are simultaneously present, requiring advanced tools to deconvolute and characterize their biological roles.
Supplementary Material
gkag220_Supplemental_Files
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Cerneckis J, Ming GL, Song H et al. The rise of epitranscriptomics: recent developments and future directions. Trends Pharmacol Sci. 2024;45:24–38. 10.1016/j.tips.2023.11.002.38103979 PMC 10843569 · doi ↗ · pubmed ↗
- 2Yan G, Xu Y, Xing X et al. NSUN 2-mediated RNA 5-methylcytosine modification of PTEN regulates cognitive impairments of mice with sleep deprivation and autophagy through PI 3K/AKT signaling. Neuromol Med. 2025;27:4. 10.1007/s 12017-024-08823-z.39753926 · doi ↗ · pubmed ↗
- 3Gao K, Yang X, Zhao W et al. NAT 10 promotes pyroptosis and pancreatic injury of severe acute pancreatitis through ac 4C modification of NLRP 3. Shock;2025;63:774–80., 10.1097/SHK.0000000000002551.39836947 · doi ↗ · pubmed ↗
- 4Zhang Q, Kang Y, Wang S et al. HIV reprograms host m(6)Am RNA methylome by viral Vpr protein-mediated degradation of PCIF 1. Nat Commun. 2021;12:5543. 10.1038/s 41467-021-25683-4.34545078 PMC 8452764 · doi ↗ · pubmed ↗
- 5Roundtree IA, Evans ME, Pan T et al. Dynamic RNA modifications in gene expression regulation. Cell. 2017;169:1187–200. 10.1016/j.cell.2017.05.045.28622506 PMC 5657247 · doi ↗ · pubmed ↗
- 6Cai Z, Song P, Yu K et al. Advanced reactivity-based sequencing methods for m RNA epitranscriptome profiling. RSC Chem Biol. 2024;6:150–69.39759443 10.1039/d 4cb 00215 f PMC 11694185 · doi ↗ · pubmed ↗
- 7Ron K, Kahn J, Malka-Tunitsky N et al. High-throughput detection of RNA modifications at single base resolution. FEBS Lett. 2025;599:19–32. 10.1002/1873-3468.15052.39543833 PMC 11726149 · doi ↗ · pubmed ↗
- 8Phillips S, Mishra T, Huang S et al. Functional impacts of epitranscriptomic m(6)A modification on HIV-1 infection. Viruses. 2024;16:127. 10.3390/v 16010127 .38257827 PMC 10820791 · doi ↗ · pubmed ↗
