Machine learning reveals sequence and methylation determinants of SaCas9–PAM interactions in bacteria
Dalton T Ham, Tyler S Browne, Claire Q Zhang, Gary W Foo, Aathavan S Uruthirapathy, Gregory B Gloor, David R Edgell

TL;DR
This study uses machine learning to identify factors affecting SaCas9 activity in bacteria, including DNA sequence and methylation patterns.
Contribution
The study introduces a machine learning model that reveals how DNA sequence and methylation influence SaCas9 activity in bacteria.
Findings
T-rich dinucleotides near the PAM site correlate with higher SaCas9 activity in bacteria.
Adenine methylation at GATC motifs inhibits SaCas9 activity, as shown by plasmid cleavage assays.
Avoiding methylated PAMs may be an evolutionary adaptation for SaCas9 to distinguish self from nonself DNA.
Abstract
Cas9 nucleases defend bacteria against invading DNA and can be used with single guide RNAs (sgRNAs) as antimicrobials and genome-editing tools. However, bacterial applications are limited by incomplete knowledge of Cas9–target interactions. Here, we generated large-scale Staphylococcus aureus Cas9 (SaCas9)/sgRNA activity datasets in bacteria and trained a machine learning model (crispr macHine trAnsfer Learning) to predict SaCas9 activity. Incorporating downstream sequences flanking the canonical NNGRRN protospacer adjacent motif (PAM) at positions [+1] and [+2] improved predictive performance, with T-rich dinucleotides at these positions correlating with higher in vivo activity. Crucially, SaCas9 showed \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs}…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —Canadian Institutes of Health Research10.13039/501100000024
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBacterial Genetics and Biotechnology · CRISPR and Genetic Engineering · RNA and protein synthesis mechanisms
Introduction
The Cas9 RNA-guided DNA endonucleases from the class 2 type II clustered regularly interspaced palindromic repeat (CRISPR) systems are a large family of paralogous proteins that function as the effectors of a bacterial adaptive immune system to restrict invading DNA [1–4]. Targeting of Cas9 to DNA sites requires a protospacer adjacent motif (PAM) sequence immediately adjacent to the protospacer [5], read out by the PAM interacting (PI) domain of Cas9 [6–8], and a CRISPR RNA (crRNA) that is complementary to the DNA protospacer [9]. Because the PAM site is only found adjacent to crRNA target sites of invading DNA [10], Cas9/crRNA cleavage at similar sites in the bacterial genome is inhibited. A trans-acting CRISPR RNA (tracrRNA) is required for processing of the crRNA transcript and for assembly of the Cas9 ribonucleoprotein complex [9]. The crRNA and tracrRNA are genetically fused as a single guide RNA (sgRNA) for most Cas9 applications [2].
Cas9 nucleases are multidomain proteins and differ from each other in coding size, the PAM sequence requirement, and the length of crRNA used for targeting [4]. The best-studied Cas9 nuclease—Streptococcus pyogenes Cas9 (SpCas9)—has a 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} -NGG-3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} PAM requirement, and uses 20-nt crRNAs. In contrast, the Cas9 nuclease—Staphylococcus aureus Cas9 (SaCas9)—is smaller (1053 residues), requires a canonical 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} -NNGRRN-3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} PAM (hereafter NNGRRN) with a preference for T at the sixth position, and uses 21- or 22-nt crRNAs [6, 11, 12]. Structural and biochemical studies of Cas9–target DNA interactions have identified base-discriminatory contacts by the PI domain to PAM sequences that have enabled engineering of Cas9 variants that modify binding to the PAM sequence [13–15].
Cas9 nucleases have been adapted as novel antimicrobials and for genome engineering in bacteria [16–20]. One critical aspect for Cas9 applications is the accurate computational prediction of activity at desired target sites while minimizing off-target toxicity [21, 22]. Most Cas9 activity predictors were developed for mammalian applications yet poorly predict sgRNA activity outside of the organism or cell line in which the training Cas9/sgRNA activity data were generated [23], and fail to accurately predict activity in bacteria [21]. This issue is compounded by the fact that there are few large-scale bacterial Cas9/sgRNA activity datasets for predictive modeling. To overcome this issue, we previously generated bacterial SpCas9/sgRNA activity datasets and developed an SpCas9/sgRNA predictor, crisprHAL (crispr macHine trAnsfer Learning) that generalized activity predictions to bacteria other than Escherichia coli [21].
Here, we expand the bacterial Cas9 toolbox by generating large-scale activity datasets for SaCas9 in E. coli and Citrobacter rodentium and use them to train crisprHAL for SaCas9 predictions. In stark contrast to other characterized Cas9 paralogs, we find that dinucleotides at positions +1 and +2 flanking the canonical SaCas9 NNGRRN PAM site are critical for activity and significantly contribute to accurate activity predictions. Strikingly, we also found that SaCas9 poorly cleaves target sites flanked by an NNGGAT[C] sequence, where NNGGAT is the PAM site and [C] is at the +1 position. Finally, we show that adenine methylation at position 5 of the PAM site reduces SaCas9 cleavage activity by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 10-fold. Our results demonstrate that choice of target site is critical for SaCas9 bacterial applications and emphasize the diversity of protein–DNA interactions in the Cas9 family of proteins. Avoidance of adenine-methylated PAM sites may be an additional mechanism for self versus nonself recognition by SaCas9, or reflect evolutionary selection against protospacer acquisition from invading phage or plasmids that use adenine methylation as an antirestriction mechanism.
Materials and methods
Bacterial strains
Escherichia coli Epi300 (F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \lambda\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^-\end{document} mcrA \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} (mrr-hsdRMS-mcrBC) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \phi\end{document} 80dlacZ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} M15 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} (lac)X74 recA1 endA1 araD139 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} (ara,leu)7697 galU galK rpsL (Str \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^R\end{document} ) nupG trfA dhfr) (Epicentre) was used for cloning the sgRNA pools. Screening sgRNA activity using a two-plasmid enrichment was done in NEB 5-alpha F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} I \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^q\end{document} E. coli (F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} proA \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^+\end{document} B \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^+\end{document} lacI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^q\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} (lacZ)M15 zzf::Tn10 (Tet \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^R\end{document} ) /fhuA2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \Delta\end{document} (argF-lacZ)U169 phoA glnV44 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \phi 80\Delta\end{document} (lacZ)M15 gyrA96 recA1 relA1 endA1 thi-1 hsdR17) strain harbouring pTox. Escherichia coli NK5830 (recA arg/F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} ) and E. coli NK5830 (recA arg /F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} zjc::Tn5; KanR dam) were gifts of Dr David Haniford (Western University). Citrobacter rodentium DBS100 was used for screening of sgRNA activity against chromosomal targets.
Construction of sgRNA pools
Oligonucleotides used in this study are provided in Supplementary Table S1. A list of sgRNA target sites is provided in Supplementary Table S2. To create pTox+KatG, a 2-kb fragment corresponding to the Salmonella enterica Typhimurium LT2 katG gene was amplified by polymerase chain reaction (PCR) using primers DE6665 and DE6666, cloned into pTox using Gibson assembly, and verified by whole plasmid sequencing. We identified 548 sites with NNGRRN PAM sequences on pTox+KatG. The 21- or 22-nt DNA sequence upstream of these sites was computationally extracted and used for library construction. The sgRNA library also contained 20 nontargeting sgRNAs (Supplementary Table S2). Sequences that contain BsaI-HF-V2 restriction sites that generate correct overhangs for golden gate cloning were added to the ends of the sgRNA sequences for subsequent cloning. The sequence 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} -CCTGGTTCTTGGTCTCTCACG-3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} was added upstream of the sgRNA and 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} -GTTTTAGAGACCGCTGCCAGTTCATTTCTTAGGG-3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} was added downstream and ordered as single-strand fragments from Integrated DNA Technologies (IDT). Second strand synthesis was performed using 1 µg of single stranded pool DNA and equimolar amounts of primer DE5224 in NEB buffer 2 (50 nM NaCl, 10 mM Tris–HCl, 10 mM MgCl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} {2}\end{document} , 1 mM dithiothreitol (DTT), pH 7.9) by denaturing at 94 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 5 min. Primers were annealed by decreasing temperature 0.1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C/s to 56 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C and holding for 5 min, and followed by decreasing temperature 0.1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C/s to 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C. To the annealed oligonucleotides, 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of Klenow polymerase (New England Biolabs) and 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of 10 mM dNTPs were added and incubated for 1 h at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C, followed by a 20 min incubation at 75 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C before being held at 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C. The resulting double-stranded DNA fragments were purified using a Zymogen DNA Clean & Concentrator-5 kit following manufacturer specifications. Golden gate cloning was used to clone the oPool and mPool into pEndo-SpCas9 and pEndo-TevSpCas9 by combining 6 pmol of oPool or mPool, 100 ng of backbone plasmid, 0.002 mg BSA, 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l T4 DNA ligase buffer (50 mM Tris–HCl, 10 mM MgCl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} {2}\end{document} , 1 mM ATP, 10 mM DTT, pH 7.5), 160 units T4 DNA ligase (New England Biolabs), and 20 units of BsaI-HF-V2 (New England Biolabs) with the following thermocycler conditions: 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 5 min then 22 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 5 min for 10 cycles, 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 30 min, 80 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 20 min, and 12 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C inf. The resulting pool was then transformed by heat shock into E. coli Epi300 and plated on Luria Broth (LB) plates (10 g/l tryptone, 5 g/l yeast extract, 10 g/l sodium chloride, 1% agar) supplemented with 25 mg/ml chloramphenicol and 0.2% (w/v) D-glucose.
To construct the sgRNA pool targeting C. rodentium DBS100 (Citro sgRNA), we sequenced our in-house strain at the London Regional Genomics Centre (London, ON) and screened a 236-kb contig for NNGRRN PAM sequences (Supplementary Table S2). Sequence upstream (21 nts) was computationally extracted, with oligonucleotides ordered and cloned as described earlier.
Pooled sgRNA two-plasmid enrichment experiment
A two-plasmid enrichment experiment was used to assay sgRNA activity as previously described [24, 25]. For liquid selections, 50 ng of the sgRNA plasmid pool was transformed into 50 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l E. coli NEB 5-alpha F \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} I \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^q\end{document} competent cells harbouring pTox by heat shock. Cells were allowed to recover in 1 ml of nonselective 2× YT media (16 g/l, 10 g/l yeast extract, and 5 g/l NaCl) for 30 min at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. The recovery was then split and 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l was added to 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of inducing 2× YT [0.04% (w/v) L-arabinose and 50 mg/ml chloramphenicol] or to 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of repressive 2× YT [0.4% (w/v) D-glucose and 50 mg/ml chloramphenicol] and incubated for 90 min at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. The two cultures were washed with 1 ml of inducing media [1× M9, 0.8% (w/v) tryptone, 1% (v/v) glycerol, 1 mM MgSO \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} _4\end{document} , 1mM CaCl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} _2\end{document} , 0.2% (w/v) thiamine, 10 mg/ml tetracycline, 25 mg/ml chloramphenicol, 0.4 mM IPTG] or repressed media [1× M9, 0.8% (w/v) tryptone, 1% (v/v) glycerol, 1 mM MgSO \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} _4\end{document} , 1 mM CaCl \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} _2\end{document} , 0.2% (w/v) thiamine, 10 mg/ml tetracycline, 25 mg/ml chloramphenicol, 0.2% (w/v) D-glucose], respectively, before addition to 50 ml of the same media that was used in the wash in a 250-ml baffled flask. These cells were grown overnight at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. Plasmids were then isolated using the Monarch Plasmid Miniprep Kit (New England Biolabs). The sgRNA locus was then PCR amplified using primers (Supplementary Table S1) containing Ilumina adapter sequence, four random nucleotides, 12-mer barcodes to specify the replicate, and plasmid-specific nucleotides at the 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} end. The resulting amplicons were sent for 150-bp paired-end Illumina MiSeq sequencing at the London Regional Genomics Centre.
Citrobacter rodentium depletion assay
In 10 independent reactions, 50 ng of the Citro sgRNA pool was electroporated into 100 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l C. rodentium competent cells and allowed to recover in 1 ml of nonselective 2× YT media (16 g/l, 10 g/l yeast extract, 5 g/l NaCl) for 30 min at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. The recovery was then split and 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l was added to 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of inducing 2× YT [0.4% (w/v) L-arabinose and 50 mg/ml chloramphenicol] or to 500 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of repressive 2× YT [0.4% (w/v) D-glucose and 50 mg/ml chloramphenicol] and incubated for 90 min at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. The resulting cultures were then added to 50 ml of LB (10 g/l tryptone, 5 g/l yeast extract, 10 g/l sodium chloride) supplemented with 25 mg/ml chloramphenicol and 0.2% (w/v) D-glucose and grown overnight at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C with shaking at 225 rpm. Plasmids were isolated using the Monarch Plasmid Miniprep Kit (New England Biolabs) and the sgRNA locus was then PCR amplified using primers (Supplementary Table S1) containing Ilumina adapter sequence, four random nucleotides, 12-mer barcodes, and plasmid-specific nucleotides at the 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} end. The resulting amplicons were sent for 150-bp paired-end Illumina NextSeq High Output sequencing at the London Regional Genomics Centre.
SaCas9 purification
SaCas9 was purified from E. coli T7 express harbouring a pET-11 SaCas9 construct with a C-terminal 6×-histidine tag. Cells were cultured at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C in LB media until it reached an A_600_ of 0.6–0.8, after which expression was induced with 1 mM IPTG at 16 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 20 h. After induction, cell culture was spun for 10 min at 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C at 6000 × g. Pellet was resuspended in binding buffer (0.5 M NaCl, 20 mM Tris–pH 8, 10 mM imidazole, 5% glycerol; 10 ml per gram of pellet) with the addition of protease inhibitors. Cells were lysed by sonication at an amplitude of 100% for 10 rounds with on and off times of 10 and 20 s, respectively. Lysed cells were spun for an hour at 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C at 4000 × g. Protein was filtered with a 0.22- \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} m filter and was run on a HisTrap HP His tag protein purification column. The column was washed with 10 column volumes of binding buffer, and then with 10 column volumes of a 50 mM elution buffer (0.5 M NaCl, 20 mM Tris–pH 8, 50 mM imidazole, 5% glycerol). Increasing levels of imidazole (100, 200, 300, and 500 mM) were used to elute protein. Fractions containing full-length SaCas9 were pooled and loaded onto a Superdex 200 column equilibrated with 50 mM of Tris–HCl (pH 8), 500 mM of NaCl, and 5% glycerol. Fractions containing full-length SaCa9 were concentrated using MilliporeSigma Amicon Ultra-0.5 Centrifugal Filter Units (Fisher Scientific). Representative gel images of SaCas9 purification are shown in Supplementary Fig. S1.
sgRNA synthesis
Synthesis of sgRNA was performed with a one-pot in vitro transcription reaction using HiScribe T7 High Yield RNA Synthesis Kit (New England Biolabs). In a 20 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l total volume, Klenow fragment (New England Biolabs), 6.67 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} M of sgRNA oligo, 6.67 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} M of universal tracr oligo, 0.125 mM of dNTPs, 10 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of 1× RiboMAX Express T7 Buffer (half of total volume), 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of T7 Express Enzyme Mix, and 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} l of nuclease free water were combined into a single PCR tube and incubated at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 4 h. RNA concentration was determined by Nanodrop.
SaCas9/sgRNA in vitro cleavage assays
SaCas9 was diluted using storage buffer [50 mM Tris–HCl (pH 8), 250 mM NaCl, 1 mM DTT, 5% glycerol] to a working concentration of 100 nM of the enzyme with 200 nM sgRNA. Proteins were prewarmed at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C for 10 min and added to a prewarmed reaction mixture consisting of 100 mM NaCl, 50 mM Tris–HCl (pH 8.0), 10 mM MgCl_2_, 1 mM DTT, and 5 nM pTox DNA. Reactions were stopped using 0.5 mM ethylenediaminetetraacetic acid, 20 mg/ml RNase A, 25 mg/ml Proteinase K, and 1× D-PBS (0.9 mM CaCl_2_, 2.7 mM KCl, 0.5 mM MgCl_2_-6H_2_0, 0.14 M NaCl, 15.2 mM Na_2_HPO_4_) and were incubated for 30 min at 37 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\circ }\end{document} C. Reactions were visualized on 1% agarose gels and the observed reaction rate (kobs) were calculated using exponential decay. Synthetic oligonucleotides with 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} 6-FAM (6-carboxyfluorescein) and 5-methyl cytosine modifications at the PAM[+1] C position, or with no methylation, were ordered from IDT. Reactions were performed as above with 13.5 nM annealed oligonucleotides and 100 nM SaCas9 and analyzed on a 15% denaturing urea polyacrylamide gel and imaged using a fluorescein filter on BioRad ChemiDoc system.
Whole-genome and plasmid methylation analysis
Genomic DNA was extracted from C. rodentium DBS100 using the NEB gDNA Extraction Kit (New England Biolabs, T3010) following manufacturer’s protocol. pTox was isolated from E. coli dam^+^ or dam^−^ strains. DNA was barcoded using the Rapid Barcoding Kit V14 (Oxford Nanopore, SQK-RBK114.96) with four barcodes (RB01–RB04). Samples were loaded onto an R10.4.1. MinION flow cell (Oxford Nanopore, FLO-MIN114) and run for 24 h. Modified bases were called and demultiplexed using Dorado (v0.9.6) with the sup,6ma,4mC_5mC model. Resulting BAM files were merged, aligned, sorted, and indexed using Samtools 1.6. Pertinent information on modified bases was extracted from the modified BAM file using Modkit (v0.4.4). Custom Python scripts indexed Citrobacter sgRNAs and determined if GATC motifs associated with PAM sequences were called as methylated using a 75% cutoff.
Dataset processing and curation
Illumina sequencing reads were merged using the program Usearch [26]. These reads were then separated by their unique 12-mer barcode sequence into control and experimental condition replicates. For the C. rodentium depletion experiment, the barcoded replicates are from the pretransformation guide validation, E. coli Epi300 control, and C. rodentium DBS100 experimental conditions. For the E. coli plasmid enrichment experiment, the barcoded replicates are from the E. coli Epi300 guide validation, E. coli NEB 5-alpha with SaCas9 repressed, and E. coli NEB 5-alpha with SaCas9 induced.
Score calculation and model input sequence encoding
RNA sequencing data are compositional and therefore must be transformed appropriately in order to obtain meaningful statistics [27, 28]. Using ALDEx2, read counts for sgRNA features across condition replicates were transformed into posterior probabilities and then converted to linear co-ordinates using the centre-log ratio [27]. We used the statistics relative abundance (rab.all), difference between condition (diff.btw), and difference within condition (diff.win) outputs. On-target activity scores used in model training and testing are the unmodified diff.btw scores from the enrichment data, which are expressed as log_2_ fold changes, and the inverse diff.btw scores from the depletion datasets. The diff.btw scores are a measure of absolute effect size (the magnitude of log_2_ fold change between groups given the input data).
For model training and testing, to ensure a suitable minimum dynamic range in scores, sgRNAs in the C. rodentium depletion dataset with control condition replicate read counts of 20 or fewer were removed. sgRNAs that were exact matches to the E. coli genome or had \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \le\end{document} 3 mismatches to the C. rodentium genome were also removed. For the two-plasmid pTox-KatG enrichment data, all sgRNAs with direct matches to an E. coli genome were removed to mitigate spurious errors related to off-target, rather than on-target, activity. Model training and testing data are provided in Supplementary Table S4.
Using one-hot encoding, model nucleotide sequences were converted from standard character representations ‘A’, ‘C’, ‘G’, and ‘T’ to binary matrices [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], respectively. This process results in 4-by-N binary matrix inputs, with N equal to the input nucleotide sequence length passed to the model.
Model construction, training, and hyperparameter tuning
A pre-existing machine learning architecture, referred to as crisprHAL [21], is used as the template for model development. The crisprHAL architecture is composed of convolutional neural networks (CNNs), bidirectional gated recurrent units (BGRUs), and dense neural networks (DNNs) (Supplementary Fig. S2). These networks are arranged in a dual branch structure, sharing an initial CNN block before branching into (i) three subsequent CNN blocks followed by two DNN blocks converging to a DNN of size 1, and (ii) a BGRU followed by two DNN blocks converging to a DNN of size 1. The outputs of both branches of the model are then concatenated and passed to a final output DNN of size 1 with a linear activation function.
The one-dimensional CNNs use 128 filters with a window size of 3 and ‘same’ padding. CNN blocks comprise the CNN layer followed by a LeakyReLU activation function, a one-dimensional max pooling layer with a size of 2 and ‘same’ padding, and a dropout layer set at 0.3. The BGRU uses a size of 128, with a recurrent dropout rate of 0.2. DNN blocks comprise a DNN of size 128 or 64, for the first and second DNN layer, respectively, each followed by a LeakyReLU activation function and a dropout layer set at 0.3. The previously mentioned architectural hyperparameters are unchanged from the original crisprHAL model.
The Python package ‘Scikit-learn’ was used to split the data into training and testing sets, as well as for the generation of datasets used in five-fold cross validation [29]. The function ‘train_test_split’ was passed a fixed state of 1 for consistent splitting of the data. All train-test splits used in the construction and testing of the model split the data into 80% training-only and 20% testing-only sets to avoid inadvertent contamination. Only the training data are used in five-fold cross validation to ensure that the testing data remain unseen. All model training is performed using the ‘Adam’ optimizer and a mean squared error loss function. Using five-fold cross validation, we selected optimal hyperparameters for batch size, learning rate, epochs, and input sequence length.
Performance and evaluation of models
To evaluate the performance of our models we primarily report the Spearman rank correlation coefficient and Pearson correlation coefficient. We report Spearman correlations due to its prior use in the evaluation of Cas9-sgRNA prediction models, and its lack of dependence on a linear association between data points. However, it is evident that our input data and predictions are linearly correlated. We used the Scipy stats Python package functions ‘spearmanr’ and ‘pearsonr’ to calculate Spearman and Pearson correlations [30]. In-house Python scripts were used to identify mean scores for each single and dinucleotide position relative to the PAM.
Results
Prediction of SaCas9 activity with deep learning
To construct a SaCas9 activity prediction model, we used the existing crisprHAL architecture that leverages a dual-branch CNN and CNN-RNN (recurrent neural network) structure [21] (Supplementary Fig. S2). This architecture can accurately predict SpCas9 and TevSpCas9 on-target activity in different bacteria, and includes the ability to be used in an optimized transfer learning protocol. TevSpCas9 is a fusion of the I-TevI nuclease domain to SpCas9 that creates a dual cutting nuclease with an extended target site requirement as compared with SpCas9 [31]; TevSaCas9 is the analogous fusion to SaCas9 (Fig. 1A) [18]. Targeting of TevSaCas9 is dependent on the sgRNA target site and SaCas9–PAM interactions (Fig. 1B). However, not all SaCas9/sgRNA sites are substrates for TevSaCas9 because they lack an upstream CNNNG motif in the correct spacing for cleavage by the Tev nuclease domain, but are still cleaved by SaCas9 (Fig. 1A).
TevSaCas9/sgRNA–DNA target interactions and activity assays in bacteria. (A) Interactions between TevSaCas9, the sgRNA, and DNA target site (not to scale). Individual components of TevSaCas9 and the DNA target site are indicated. Zn, zinc finger in the I-TevI linker. The canonical PAM and flanking nucleotides at positions +1/+2 identified in this study as critical for SaCas9 activity are highlighted. (B) Summary of protein–DNA contacts in the PAM region by the PI domain of SaCas9, adapted from [6]. (C) Schematic of the pTox enrichment experiment in E. coli. (D) Violin plot of normalized activity for sgRNAs targeted to pTox-KatG grouped by the last nucleotide of the PAM sequence. (E) Schematic of C. rodentium sgRNA depletion assay. (F) Violin plot of normalized activity for sgRNAs targeted to the C. rodentium chromosome grouped by the last nucleotide of the PAM sequence.
We used two complementary experimental methodologies to generate large-scale (Tev)SaCas9/sgRNA activity datasets for model training. First, we used a well-established two-plasmid selection system in E. coli that includes a toxic plasmid expressing the CcdB DNA gyrase toxin (pTox) and a second plasmid (pEndo) that expresses (Tev)SaCas9 and an sgRNA targeting pTox (Fig. 1C) [21, 24, 25]. In this assay, active (Tev)SaCas9/sgRNA combinations will become enriched over time relative to cells expressing inactive or poorly active (Tev)SaCas9/sgRNA combinations that do not eliminate pTox and its encoded CcdB gyrase toxin. We targeted a pool of 1161 sgRNAs with 21- and 22-nt crRNA lengths to the same 548 sites on pTox (Supplementary Table S2). We found no difference in activity between sgRNAs with 21- or 22-nt length crRNA regions and used 21-nt crRNAs for following experiments (Supplementary Fig. S3 and Supplementary Table S3). Second, we used a pooled sgRNA depletion experiment that targeted 24 392 sgRNAs to the C. rodentium genome (Supplementary Table S2). In this experiment, active (Tev)SaCas9/sgRNA combinations become depleted relative to inactive or poorly active combinations because cleavage of the bacterial chromosome cannot be repaired thereby causing DNA replication fork collapse and cell death (Fig. 1E) [21, 32]. In both datasets, we were able to identify active from inactive or poorly active (Tev)SaCas9/sgRNA combinations. We found that sgRNAs that targeted sites with PAM sequences that had a T at the last position (NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf T}\end{document} ) showed higher activity than sites not with T PAM sequences (Fig. 1D and F and Supplementary Fig. S4), as previously described in eukaryotic systems [33].
Following data curation, the C. rodentium depletion dataset was separated into 80% training (n = 16 202) and 20% testing (n = 4051) sets. To obtain a suitable input sequence length for model training, we extended the 21-nt sgRNA target sequence upstream and downstream by 1 nt at a time, evaluating the impact of incremental nucleotide additions to model performance (Fig. 2A). Using five-fold cross validation to assess the relative contributions of additional nucleotides, we found that the extension of the input sequence upstream of the sgRNA target (PAM distal) provided no clear improvement to model predictive performance, as measured by Spearman correlation (Fig. 2A). Conversely, inclusion of the NNGRRN PAM sequence in the model provided a large performance boost, particularly for the last three nucleotides (RRN) (Fig. 2A). Intriguingly, inclusion of the [+1] nucleotide downstream of the PAM provided another notable performance increase. Models constructed on each 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} nucleotide option of the NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf N}\end{document} PAM supported these observations, with a notable reliance of the model on the [+1] downstream nucleotide to obtain high predictive performance (Fig. 2A). Our final TevSaCas9/sgRNA crisprHAL model obtained a high correlation for predicted and observed activities for the C. rodentium depletion dataset, with 0.895 Spearman (Fig. 2B) and 0.902 Pearson correlation values (Supplementary Fig. S5).
Tev)SaCas9 target site sequence length model input optimization and model performance. (A) Spearman rank correlation between predicted and experimentally determined on-target activity using 10 repetitions of five-fold cross validation on the C. rodentium 80% training set. The primary test using all PAM sites is shown in black with the standard deviation across tests shown in dark grey. Secondary tests are separated by the final nucleotide of the NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} PAM: NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} (yellow), NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} (green), NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} (blue), and NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} (purple). All tested input sequence lengths include the 21-nt sgRNA target site sequence. Upstream sequence length tests extend the input sequence to include the −1 to −10 nucleotides upstream of the target site. Downstream sequence length tests extend the model input to include the NNGRRN PAM, and up to 10 nucleotides downstream of the target site. Correlations between experimentally determined and predicted on-target activity from the C. rodentium SaCas9 crisprHAL model on (B) the 20% held-out portion of the C. rodentium depletion dataset (n = 4051), (C) the NNGRR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} subset of the 20% held-out portion of the C. rodentium depletion dataset (n = 3187), and (D) the E. coli pTox plasmid enrichment test set (n = 363).
To determine if higher activity sgRNAs targeting sites with NNGRR PAMs were biasing model performance, we tested the model with a subset of sites that contain only NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf V}\end{document} (V = A, C, and G) PAM sequences (Fig. 2C). Model performance remained high, with Spearman and Pearson correlations of 0.860 and 0.865, respectively. We also examined performance on other NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf N}\end{document} PAM subsets (Supplementary Fig. S5). Unsurprisingly, when testing the NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf T}\end{document} subset, the model obtained a reduced Spearman correlation of 0.593, but a Pearson correlation of 0.802. This is likely due to a lack of diversity in the measured activity scores of sgRNAs targeting sites with NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf T}\end{document} PAMs relative to sites with NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf V}\end{document} PAM sequences. The model trained and tested on only NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf V}\end{document} data resulted in a test set Spearman correlation of 0.846 (Supplementary Fig. S5). We note that this NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf V}\end{document} performance is lower than the model trained on the full NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf N}\end{document} dataset, indicating that model performance is not solely dependent on the identification of higher activity sgRNAs that target sites with NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf T}\end{document} PAMs.
To test if the crisprHAL NNGRRN model trained on the C. rodentium depletion dataset was transferable to other bacteria and data generated by different methods, we used the curated pTox enrichment dataset collected in E. coli consisting of 21-nt length sgRNAs (Supplementary Table S4). We found that the NNGRRN crisprHAL C. rodentium model performed well on the pTox dataset, with Spearman and Pearson correlations of 0.864 and 0.891, respectively (Fig. 2D). The model also performed well on an NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf V}\end{document} subset of the pTox dataset (Spearman, 0.813; Pearson, 0.844) (Supplementary Fig. S5).
Collectively, these data show that the crisprHAL machine learning architecture originally developed for SpCas9 can be applied to SaCas9 to generate high confidence activity predictions in different bacteria (C. rodentium and E. coli) and can be applied to datasets generated by different activity measurements (enrichment versus depletion). Surprisingly, the model identified the SaCas9 PAM motif and downstream sequence as providing significant contributions to model performance. Previous SaCas9 prediction models have focused on the contribution of the sgRNA sequence and have not considered the PAM or downstream sequence as critical for predictive modeling.
Pyrimidine-rich dinucleotides downstream of the PAM sequence correlate with enhanced SaCas9 activity
Inclusion of sequence downstream of the PAM site had a noticeable impact on crisprHAL model performance. To understand if the model enhancement was due to specific nucleotide preferences at downstream positions, we examined the mean activity scores for all (Tev)SaCas9/sgRNAs from the C. rodentium depletion dataset across 34 nt encompassing 4-nt upstream of each sgRNA binding site, the 21-nt target site and 6-nt PAM, and 8-nt downstream (Fig. 3). Plotting these scores by individual nucleotide position confirmed the striking preference for T in last position of the PAM site (NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\bf T}\end{document} ), as well as a pyrimidine preference (Y = C or T) at the [+1] position downstream of the PAM (Fig. 3A). We next analyzed the contributions of dinucleotides across the same region (Fig. 3B), revealing a strong bias towards T-rich dinucleotides at positions N_6_[+1] and to a lesser extent at positions [+1][+2]. At these same positions, other dinucleotides were disfavored, including GC-rich pairs. Similar preferences at individual nucleotide or dinucleotide positions were observed with the smaller pTox enrichment dataset generated in E. coli (Supplementary Fig. S6).
Nucleotide preference across (Tev)SaCas9 target sites. (A) Heatmap of mean single or (B) dinucleotide activity score per position for all sgRNAs from the Citrobacter pooled sgRNA depletion dataset.
Low SaCas9 activity correlates with DNA methylation
To further explore nucleotide preferences at the PAM site and flanking sequence, we focused on the last three nucleotides of the PAM (positions 4, 5, and 6) and the flanking [+1] position (NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm RRN[N]}\end{document} ) (Fig. 4). The PI domain of SaCas9 makes base discriminatory contacts to positions 3, 4, 5, and 6 of the PAM site (Fig. 1B) [6], and grouping activity by these positions should reveal trends in PAM nucleotide preference. Notably, all sites with NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{{\rm {RR}{\bf T[N]}}}\end{document} PAM[+1] sequences had very high observed or predicted activity, except for a striking exception for sites with an NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{{\rm GA}{\bf T[C]}}\end{document} PAM[+1] (Fig. 4A and B). This analysis also showed that the majority of NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm RRV}\underline{{\bf [N]}}\end{document} PAM[+1] sequences tend to cluster at lower activity, although there are several combinations with activity equivalent to sites with an NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm RRT}\end{document} PAM, especially sites with NNG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{{\rm RRA}}\underline{{\bf [Y]}}\end{document} PAM[+1] sequences.
Nucleotide preference in the PAM and [+1] flanking position for sgRNA target sites in C. rodentium. Violin plots of (A) crisprHAL predicted or (B) observed activity for all tetranucleotide combinations at PAM positions 4, 5, 6, and [+1] flanking position separated by nucleotide identity at PAM position 6. See Supplementary Fig. S4 for violin plots for dinucleotide preference at PAM position 6 and [+1] position.
In considering why PAM[+1] sites with NNGGAT[C] sequences would be recalcitrant to SaCas9 cleavage, we noted that these sites included the sequence 5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} -GATC-3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ^{\prime }\end{document} (hereafter GATC). The sequence GATC is noteworthy because in E. coli and other proteobacteria it is a site for DNA adenine methyltransferase (DAM) that methylates the N6 position of A (m6A) within GATC sequences [34], with roles in DNA replication, mismatch DNA repair, gene regulation, and restriction-modification systems [35, 36]. PAM[+1] sites with an NNGG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm G}\end{document} T[C] sequence (where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm G}\end{document} replaces A at position 5) had high crisprHAL predicted and measured experimental activity (Fig. 4A and B). Similarly, PAM sites with NNGGAT[A], NNGGAT[T], and NNGGAT[G] sequences all had high activity, suggesting that PAM[+1] sequences containing the sequence GAT[C] with an A residue in the fifth position were responsible for the low measured and predicted SaCas9 activity. Other putative and partially characterized adenine methyltransferases have been identified in S. aureus, and it is possible that some may be active on sequences containing a GATC motif [37].
We used Oxford Nanopore sequencing to map methylated DNA sequences in C. rodentium and to correlate methylation status with sgRNA activity (Fig. 5A and B). Of the 20 253 sgRNAs with measured activity in C. rodentium, 438 targeted sites with NNGGAT[C] PAM sequences and all of these were m6A methylated on both strands at GATC sites (Fig. 5C and Supplementary Table S5). Strikingly, sgRNA sites with NNGGAT[C] PAM sequences had a mean observed activity of −2.70 compared to the mean activity of −0.27 for sites with NNGRRN PAM sequences. Notably, sgRNA sites with PAM sequences that differed from GATC by a single nucleotide had significantly higher activities than sites with GATC-containing PAM[+1] sequences (Fig. 5C). In particular, shifting the position of an sgRNA by a single nucleotide such that A of the GATC site was at PAM position 6 instead of position 5 increased sgRNA activity 12-fold (Fig. 5B). We also considered the impact of single adenine methylation events that occurred solely in the crRNA-binding region and not in the PAM sequence; the mean observed activity at these sites was −0.32 (Supplementary Fig. S6), very similar to the observed mean activity of −0.27 for sgRNAs targeting sites without any methylation in the crRNA or PAM regions. Thus, adenine methylation at sites with NNGGAT[C] PAM sequences impacts SaCas9 activity.
PAM adenine methylation is correlated with low sgRNA activity in C. rodentium. (A) Plot of m6A adenine methylation and sgRNA position in a representative region of the C. rodentium genome, with sgRNAs targeting both strands indicated. Red lines are sgRNAs with GATC sites in the PAM[+1] sequence. (B) Zoomed-in view showing the positions of sgRNAs that target sites with and without a GATC site in the PAM[+1] sequence and their observed activity in C. rodentium (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document}12-fold difference). Note that the two sgRNAs differ by a single nucleotide position that changes the position of the GATC site relative to the PAM sequence. (C) Boxplot of sgRNA activity in C. rodentium at target sites based on whether the sites contain an NNGGAT[C] methylated PAM (red text), sites with single nucleotide changes in the GATC sequence (blue text), or all PAM sites. The solid horizontal line is the median point and the whiskers indicate the maximum and minimum values. Outlier data points are shown as solid circles.*
We also mapped the positions of cytosine methylation relative to PAM sequences, finding that some sites with PAMs sequences that were cytosine methylated at positions 1 or 2 had low activity (Supplementary Fig. S7). The bimodal distribution of activities at these sites shows that the impact on cleavage was not as consistent as for adenine methylation of PAM[+1] sequences. Because SaCas9 does not make base-specific contacts to positions 1 or 2 of the PAM sequence [6], the reduction in cleavage is not likely to result from a disruption of specific binding due to cytosine methylation. In S. aureus, the Sau3AI restriction enzyme also recognizes GATC sites and methylation of the C (at the N5 position, m5C) within this motif protects against Sau3AI cleavage [38, 39]. We found that SaCas9 in vitro activity on a synthetic substrate that contained a target site with an NNGGAT[C] PAM sequence where the C at the +1 position was methylated (m5C) was equivalent to cleavage of a nonmethylated substrate (Supplementary Fig. S8). Based on these observations, we focused on characterizing the impact of adenine methylation in the PAM[+1] sequence on SaCas9 activity.
Adenine methylation inhibits SaCas9 cleavage
To further investigate the impact of adenine methylation of PAM sequences on SaCas9 activity, we purified pTox from E. coli dam^+^ and dam^−^ strains and confirmed the DAM methylation status by DpnI digests and Oxford Nanopore sequencing. Based on the pTox pooled enrichment experiment (Fig. 1C), we picked an sgRNA with very low observed cleavage that targeted a site with a GATC PAM[+1] sequence (sgRNA_1447_) (Fig. 6A) and a high activity sgRNA that targeted a site with a GAT[G] PAM[+1] sequence (sgRNA_295_) (Fig. 6D). We found that cleavage of methylated pTox (DAM+) by purified SaCas9 and in vitro transcribed sgRNA_1447_ was reduced \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 3-fold relative to cleavage of nonmethylated DNA (DAM−) (k_obs_DAM(+) 0.19 +/− 0.03 min^−1^ versus kobs_DAM− 0.63 +/− 0.22 min^−1^) (Fig. 6B and C). In contrast, cleavage by SaCas9/sgRNA_295 was not affected by DAM methylation status (Fig. 6E and F).
PAM adenine methylation inhibits SaCas9 activity. (A) Plot of methylated adenines (orange filled circles) in pTox by strand with black dots representing position of guides tested in vitro. sgRNA295 and 1447 that are tested below are indicated. (B) Sequence of pTox target site (position 1147) where SaCas9 activity is predicted to be impacted by DNA methylation (indicated by a dashed rectangle). The site-directed mutation to knockout the DAM site is indicated to the right. NTS, nontarget strand; TS, target strand. (C) Plot of disappearance of supercoiled pTox with the wild-type or mutated target site versus time, with pTox isolated from DAM(+) or DAM(−) E. coli strains. (D) Plot of kobs (min-1) calculated from plots in panel B for the different pTox plasmids isolated from DAM(+) or DAM(−) strains. (E) Sequence of pTox target site (position 1447) that is not methylated in the PAM+1 region, and the corresponding site-directed mutation to convert the PAM+1 to a DAM site. (F) and (G) are as for panels B and C above. For all plots, points are the mean value of three independent replicates with whiskers representing the standard deviation from the mean. Images of cleavage reactions analyzed by agarose gel electrophoresis are in Supplementary Fig. S9.
We next changed the sgRNA_1447_ PAM sequence on pTox to ACGG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm G}\end{document} T[C] from ACGG \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm A}\end{document} T[C] to knockout the GATC DAM site (Fig. 6A). This change equalized the SaCas9 in vitro cleavage rate on pTox isolated from DAM(−) and DAM(+) strains (Fig. 6B and C). In contrast, creating a DAM site in the PAM sequence of sgRNA_295_ by changing ATGGAT[G] to ATGGAT[C] reduced SaCas9/sgRNA_295_ cleavage of methylated pTox to k_obs_DAM(+) 1.1 +/− 0.1 min^−1^ from k_obs_DAM(+) 4.6 +/− 0.8 min^−1^, an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 4.2-fold reduction in cleavage rate. To extend these observations, we assayed 10 additional sgRNAs targeting pTox that had GATC-containing PAM[+1] sequences (Fig. 7, Supplementary Fig. S10, Supplementary Fig. S11, and Supplementary Table S6); all showed methylation-sensitive cleavage phenotypes with mean k_obs_DAM(+) rates slower than k_obs_DAM(−) rates.
Impact of pTox methylation on SaCas9/sgRNAs in vitro activity. Shown is plot of kobs (min−1) rates for 10 sgRNAs targeting different sites on pTox determined from DNA isolated from DAM(+) or DAM(−) E. coli strains. Points are mean value of three replicates with error represented by standard deviation from the mean and plotted on log10 scale.
Collectively, these data show a compelling agreement between machine learning predictions of SaCas9 activity, mapped epigenetic modifications, and SaCas9 activity measured in vivo and in vitro. Our data indicate that adenine methylation of GATC sequences in the context of NNGGAT[C] PAM[+1] sites reduces SaCas9 activity \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 10-fold in vivo and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 6-fold in vitro relative to unmethylated GATC sites and to sites with NNGRRN PAM sequences. Furthermore, cleavage activity in vitro can be manipulated by single-point mutations in PAM[+1] sequences that create or ablate GATC methylation sites, providing a mechanistic link between methylation and SaCas9 cleavage activity.
Discussion
Here, we use a combination of high-throughput biological assays, machine learning predictions of activity, methylation sequencing, and in vitro assays to understand factors that contribute to SaCas9 activity in bacteria. We found that the sgRNA sequence, the PAM sequence and flanking nucleotides, and adenine methylation within the PAM sequence all influence SaCas9 activity. These factors are not mutually exclusive, and do not rule out cytosine (or other) methylation as contributing to activity.
Current guidelines for SaCas9 applications suggest using a 21- or 22-nt sgRNA complementary to a DNA target site upstream of an NNGRRN PAM sequence [11], with some evidence that NNGRRT PAM sequences support higher activity [33]. Available SaCas9 activity prediction models are biased towards mammalian applications and use activity data from in vitro methodologies or from integrated target or PAM site libraries in mammalian cells that measured activity in a fixed sequence context, rather than at sites in their native genomic context. Thus, current SaCas9 prediction tools are not trained with relevant activity data and are not suited for bacterial activity predictions. We found that using our large-scale SaCas9 activity dataset to train a general-purpose machine learning architecture [21], combined with appropriate data normalization and hyperparameter tuning, yielded very high prediction correlations in E. coli and C. rodentium. This finding suggests that accurate predictions could be developed for other Cas9 orthologs using the same machine learning architecture given sufficiently large, high-quality activity datasets generated in bacteria.
One striking finding from our activity datasets was the influence of nucleotides downstream of the PAM sequence on SaCas9/sgRNA prediction accuracy and activity. In particular, pyrimidine-rich dinucleotides at the +1 and +2 positions downstream of the PAM, in combination with a T preference at position 6 of the PAM, had a strong positive effect on activity. The preference of SaCas9 for a T in the last position of the NNGRRN PAM sequence was noted in the initial characterization of SaCas9 PAM requirements. Structural data show that Arg991 makes a direct contact to the O4 of thymine in position 6 of the PAM on the nontarget strand [6], providing a rationale for a T preference. However, our observation that downstream sequence strongly impacts activity suggests that the PAM site should be revised to include NNGRR \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm T}\end{document} H sequences, where H = T, C, and A. One possible reason for the discrepancy between our data and previous studies [11] could be due to co-operative binding of residues in the SaCas9 PI domain to the PAM sequence. Co-operative binding would reduce the information at individual positions in a sequence logos analysis and miss higher order relationships between adjacent nucleotides in the PAM flanking region. Interestingly, exonuclease footprinting and single-molecule studies revealed interactions of SaCas9 with flanking DNA sequence 6-bp past the PAM site [40]. It is possible that T-rich dinucleotides flanking the PAM site transiently stabilize the initial interaction of SaCas9 with the PAM site. We previously reported a similar observation for SpCas9 [21], where accuracy of activity predictions was increased by including downstream sequence. Interaction with DNA flanking the PAM site may be a general feature of Cas9 orthologs.
Surprisingly, we found that SaCas9 measured and predicted activity at sites with NNGGAT[C] PAM sequences was significantly lower than at target sites with other PAM sequences, including sites with NNGGGT[C] or NNGAAT[C] PAM sequences. In bacterial genomes, the sequence GATC is a methylated at the N6 position of adenine by DAM [35, 36] and other methylases. In the crystal structure of SaCas9 with a DNA substrate containing a TTGA \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \underline{\rm A}\end{document} T PAM sequence, the N6 of A5 (adenine in the 5th position) forms a water-mediated hydrogen with Asn985 [6]; methylation of N6 would disrupt this hydrogen bond. In our experiments, methylation at the GATC motif within the PAM site reduces SaCas9 activity by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 6–10 fold. We also found that PAM sites with cytosine methylation at positions 1 or 2 impacted SaCas9 activity, although this effect was not as pronounced as for adenine methylation. While the mechanistic basis of how adenine or cytosine methylation reduces activity requires further study, it is possible that adenine methylation impacts specific recognition of the A5 position by SaCas9 in the PAM sequence to destabilize or delay formation of an active R-loop required for cleavage.
Previous to our study, it was generally thought that DNA methylation had little impact on Cas9 activity [41, 42]. From an evolutionary viewpoint, it would be advantageous for Cas9 paralogs to be insensitive to DNA methylation (or other modifications), as this would be a simple strategy for invading elements to counter Cas9 activity. Interestingly, phage T4 DNA, where cytosine is replaced by glucosyl-hydroxymethylcytosine, is resistant to SpCas9 restriction if the modifications lie within the sgRNA-binding site [43]. However, phage T4 can be restricted by SpCas9 when programmed with an sgRNA rather than with a crRNA/tracRNA combination, indicating that some high activity sgRNAs can overcome the DNA modifications. It is interesting to speculate why SaCas9 would avoid NNGGAT[C] PAM sequences—this may be an additional mechanism for self versus nonself recognition by downregulating cleavage at methylated genomic sites that match crRNA sequences. A number of Staphylococcal phages are predicted to carry DAMs [44] and avoidance of methylated NNGGAT[C] PAM sites in phage DNA may be an evolutionary adaptation of SaCas9 to counter DNA methylation by these phages as an antirestriction strategy. The native SaCas9 CRISPR system is not abundant in sequenced Staphylococcal genomes, making it difficult to correlate if naturally acquired protospacers map to phage or plasmid sequences with NNGGAT[C] PAM sites.
A recent in vitro profiling of PAM sequences of Cas9 paralogs identified many that could potentially contain motifs for DAM or other methyltransferases [45]. Understanding if DNA methylation impacts activity of Cas9 orthologs would enhance on-target activity predictions for bacterial applications, particularly given that adenine and cytosine methylation are almost universal in bacteria. In contrast, it is not clear that adenine methylation impacts SaCas9 activity in mammalian systems as evidence for adenine methylation is controversial [46] but appears widespread in nonmammalian eukaryotes [47, 48]. We suggest that avoiding target sites with NNGGAT[C] PAM sequences would be a simple way to negate this issue.
In summary, we have shown that the combination of high-quality activity datasets and a general-purpose machine learning architecture can identify previously unrecognized DNA sequence features and epigenetic modifications that influence Cas9 activity in a biological context.
Supplementary Material
gkaf1520_Supplemental_Files
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Mojica FJ, Díez-Villaseñor CS, García-Martínez J et al. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol. 2005;60:174–82. 10.1007/s 00239-004-0046-3.15791728 · doi ↗ · pubmed ↗
- 2Jinek M, Chylinski K, Fonfara I et al. A programmable dual-RNA–guided DNA endonuclease in adaptive bacterial immunity. Science. 2012;337:816–21. 10.1126/science.1225829.22745249 PMC 6286148 · doi ↗ · pubmed ↗
- 3Koonin EV, Makarova KS. Origins and evolution of CRISPR-Cas systems. Philos Trans R Soc B. 2019;374:20180087. 10.1098/rstb.2018.0087.PMC 645227030905284 · doi ↗ · pubmed ↗
- 4Koonin EV, Makarova KS, Zhang F. Diversity, classification and evolution of CRISPR-Cas systems. Curr Opin Microbiol. 2017;37:67–78. 10.1016/j.mib.2017.05.008.28605718 PMC 5776717 · doi ↗ · pubmed ↗
- 5Bolotin A, Quinquis B, Sorokin A et al. Clustered regularly interspaced short palindrome repeats (CRISP Rs) have spacers of extrachromosomal origin. Microbiology. 2005;151:2551–61. 10.1099/mic.0.28048-0.16079334 · doi ↗ · pubmed ↗
- 6Nishimasu H, Cong L, Yan WX et al. Crystal structure of Staphylococcus aureus Cas 9. Cell. 2015;162:1113–26. 10.1016/j.cell.2015.08.007.26317473 PMC 4670267 · doi ↗ · pubmed ↗
- 7Nishimasu H, Ran FA, Hsu PD et al. Crystal structure of Cas 9 in complex with guide RNA and target DNA. Cell. 2014;156:935–49. 10.1016/j.cell.2014.02.001.24529477 PMC 4139937 · doi ↗ · pubmed ↗
- 8Jinek M, Jiang F, Taylor DW et al. Structures of Cas 9 endonucleases reveal RNA-mediated conformational activation. Science. 2014;343:1247997. 10.1126/science.1247997.24505130 PMC 4184034 · doi ↗ · pubmed ↗
