Comprehensive analysis of repetitive extragenic palindrome sequences identified in bacteria and archaea using a new web-based tool, RepRanger
Oleg N. Murashko, Connor Morgan-Lang, Chen-Hsin Albert Yu, Hsin-Nan Lin, Anna Chao Kaberdina, Shin-Yu Kung, Vladimir R. Kaberdin, Sue Lin-Chao

TL;DR
A new web tool called RepRanger helps identify REP sequences in bacteria and archaea, revealing their roles in gene regulation and adaptation.
Contribution
RepRanger is a novel web-based tool for identifying REP sequences, uncovering their functional roles and sequence diversity across bacterial and archaeal genomes.
Findings
Over 4,000 REPs were identified in the E. coli MG1655 genome using RepRanger.
More than 50% of small noncoding RNAs in E. coli contain REP sequences.
REP sequences are conserved across pathogenic and environmental E. coli strains but not in commensal and laboratory strains.
Abstract
Repetitive extragenic palindromic (REP) sequences were first discovered in the intergenic regions of Escherichia coli and later found in phylogenetically distinct bacterial species. However, their biological roles and sequence conservation remain unclear. Considering the growing number of microbial genome sequencing and annotation projects, along with the lack of online tools for REP identification, we developed a new web-based platform, RepRanger, to rapidly identify and annotate putative palindromic elements, including REPs. Using RepRanger, we identified >4,000 REPs in the E. coli MG1655 genome. Moreover, 81 (~52%) of its 157 small noncoding RNAs (sRNAs; known to regulate gene expression) contain REPs. Further analysis using RepRanger, alongside other bioinformatics tools, yielded 10 REP consensus motifs. Notably, REPs potentially involved in translational control (i.e., located…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5
Fig 6| Number | sRNA | REP/RIT |
|---|---|---|
| 1 | sokC | – |
| 2 | sroA | – |
| 3 | sgrS | REP |
| 4 | ftsO | – |
| 5 | tff | – |
| 6 | ispZ | – |
| 7 | eyeA | – |
| 8 | nc1 | REP |
| 9 | sraA | REP |
| 10 | chiX | REP |
| 11 | ipeX | REP |
| 12 | sokE | – |
| 13 | sroC | REP |
| 14 | chiZ | – |
| 15 | sdhx | – |
| 16 | zbiJ | REP |
| 17 | rybA | RIT |
| 18 | rybB | REP |
| 19 | sraB | REP |
| 20 | C0293 | REP |
| 21 | narS | – |
| 22 | rttR | REP |
| 23 | mcaS | REP |
| 24 | fnrS | REP |
| 25 | ralA | – |
| 26 | micC | REP |
| 27 | rydC | REP |
| 28 | SokB | REP |
| 29 | mgrR | – |
| 30 | dicF | – |
| 31 | rydB | RIT |
| 32 | rprA | REP |
| 33 | ryeA | REP |
| 34 | sdsR | REP |
| 35 | micL | REP |
| 36 | 3'ETS leuZ | REP |
| 37 | sdsN | REP |
| 38 | dsrA | REP |
| 39 | rseX | – |
| 40 | isrC | REP |
| 41 | sibA | REP |
| 42 | sibB | REP |
| 43 | cyaR | REP |
| 44 | orzP | REP |
| 45 | micF | – |
| 46 | ryeG | REP |
| 47 | sroE | REP |
| 48 | timR | RIT |
| 49 | ryfA | REP |
| 50 | glmY | REP |
| 51 | ohsC | REP |
| 52 | ryfD | – |
| 53 | raiZ | REP |
| 54 | micA | REP |
| 55 | sokX | – |
| 56 | csrB | RIT |
| 57 | gcvB | REP |
| 58 | omrA | – |
| 59 | omrB | – |
| 60 | sibC | REP |
| 61 | och5 | – |
| 62 | sibD | – |
| 63 | sibE | REP |
| 64 | sraG | REP |
| 65 | arcZ | REP |
| 66 | ryhB | – |
| 67 | agrA | REP |
| 68 | agrB | – |
| 69 | arrS | REP |
| 70 | gadF | REP |
| 71 | gadY | – |
| 72 | rirA | – |
| 73 | istR | RIT |
| 74 | rbsZ | RIT |
| 75 | gImZ | REP |
| 76 | esrE | – |
| 77 | spf | RIT |
| 78 | csrC | – |
| 79 | cpxQ | REP |
| 80 | oxyS | REP |
| 81 | sroH | REP |
| 82 | malH | REP |
| 83 | pspH | REP |
| 84 | ryjA | – |
| 85 | aspX | REP |
| 86 | GO-10706 | – |
| 87 | ryjB | REP |
| Category | Type | Organism | Genome size, Mb | Gene number | GC, % | REP number | RIT number | Genome assembly |
|---|---|---|---|---|---|---|---|---|
| Pathogenic | Pathotype | |||||||
| Intestinal | Enteropathogenic (EPEC) | 5.5 | 5,706 | 50.5 | 4,224 | 1,766 |
| |
| 5.2 | 5,399 | 50.5 | 4,094 | 1,710 |
| |||
| 5.3 | 5,325 | 50.5 | 3,926 | 1,736 |
| |||
| Shiga toxin producing (STEC) | 5.3 | 5,346 | 50.5 | 4,133 | 1,818 |
| ||
| 5.7 | 5,993 | 50.5 | 4,333 | 1,946 |
| |||
| 5.7 | 5,950 | 50.5 | 4,374 | 1,957 |
| |||
| Verocytotoxin producing | 5.0 | 5,126 | 50.5 | 3,928 | 1,685 |
| ||
|
| 5.7 | 5,736 | 50.5 | 4,621 | 2,034 |
| ||
| 5.2 | 5,425 | 50.5 | 3,872 | 1,753 |
| |||
| Enterohemorrhagic |
| 5.9 | 5,986 | 50.5 | 4,734 | 2,113 |
| |
| 5.7 | 5,750 | 50.5 | 4,811 | 2,048 |
| |||
| 5.6 | 5,667 | 50.5 | 4,762 | 2,109 |
| |||
| Enterotoxigenic (ETEC) | 5.2 | 5,197 | 50.5 | 4,620 | 1,910 |
| ||
| 5.2 | 5,078 | 50.0 | 4,439 | 1,877 |
| |||
| 5.3 | 5,321 | 50.5 | 4,535 | 1,878 |
| |||
| Enteroinvasive | 5.0 | 4,876 | 50.5 | 4,082 | 1,759 |
| ||
| 5.0 | 4,976 | 50.5 | 3,676 | 1,616 |
| |||
| 5.2 | 5,053 | 50.5 | 4,105 | 1,786 |
| |||
| Enteroaggregative | 5.1 | 4,961 | 50.5 | 3,776 | 1,637 |
| ||
| 5.3 | 5,293 | 50.5 | 3,877 | 1,717 |
| |||
| 5.1 | 5,106 | 50.5 | 3,869 | 2,123 |
| |||
| Diffuse adhering (DAEC) | 5.3 | 5,226 | 50.5 | 4,110 | 1,763 |
| ||
| 5.1 | 5,461 | 51.0 | 4,173 | 1,764 |
| |||
| 5.0 | 5,213 | 51.0 | 4,113 | 1,749 |
| |||
| Adherent invasive (AIEC) | 5.3 | 5,238 | 50.5 | 4,281 | 1,882 |
| ||
| 5.0 | 5,148 | 50.5 | 3,861 | 1,709 |
| |||
| 4.8 | 4,627 | 50.5 | 3,964 | 1,691 |
| |||
| Extraintestinal | Uropathogenic (UPEC) | 5.2 | 5,097 | 50.5 | 4,348 | 1,879 |
| |
| 5.2 | 5,056 | 50.5 | 4,206 | 1,822 |
| |||
| 5.3 | 5,247 | 50.5 | 4,450 | 1,938 |
| |||
| Neonatal meningitis | 4.7 | 4,564 | 51.0 | 3,521 | 1,480 |
| ||
| 5.8 | 5,142 | 50.5 | 4,078 | 1,790 |
| |||
| 5.1 | 5,024 | 50.5 | 4,090 | 1,806 |
| |||
| Avian pathogenic (APEC) | 4.9 | 4,875 | 50.5 | 3,495 | 1,457 |
| ||
| 5.3 | 5,250 | 50.5 | 4,245 | 1,881 |
| |||
| 5.0 | 4,887 | 50.5 | 4,027 | 1,766 |
| |||
| Sepsis associated | 5.6 | 5,558 | 50.5 | 4,662 | 2,054 |
| ||
| 5.4 | 5,310 | 50.5 | 4,410 | 1,906 |
| |||
| 5.1 | 4,980 | 50.5 | 3,984 | 1,718 |
| |||
| Nonpathogenic | Serotype | |||||||
| Commensal | 5.1 | 4,878 | 50.5 | 4,218 | 1,798 |
| ||
|
| 4.6 | 4,422 | 51.0 | 3,935 | 1,709 |
| ||
| 4.8 | 4,672 | 50.5 | 4,021 | 1,775 |
| |||
| Environmental | 5.2 | 5,037 | 50.5 | 4,431 | 1,856 |
| ||
| 5.2 | 5,023 | 50.5 | 4,033 | 1,774 |
| |||
| 5.4 | 5,400 | 50.5 | 4,536 | 2,005 |
| |||
| Laboratory | 4.6 | 4,639 | 51.0 | 4,072 | 1,683 |
| ||
| 4.6 | 4,453 | 51.0 | 3,998 | 1,661 |
| |||
| 4.5 | 4,437 | 50.5 | 3,691 | 1,582 |
|
| Kingdom and phylum | Class | Organism | Genome size, Mb | Gene number | GC, % | REP number | RIT number | Genome assembly |
|---|---|---|---|---|---|---|---|---|
| Bacteria | ||||||||
| Pseudomonadota (formerly Proteobacteria) | Alpha | 4.8 | 5,724 | 65.5 | 6,569 | 782 |
| |
| 3.6 | 3,554 | 61.5 | 3,707 | 737 |
| |||
| 3.0 | 2,899 | 68.0 | 3,055 | 348 |
| |||
| Beta |
| 3.9 | 3,549 | 48.0 | 2,879 | 1,201 |
| |
|
| 4.2 | 3,871 | 57.0 | 3,552 | 1,170 |
| ||
|
| 2.7 | 2,644 | 69.0 | 2,466 | 301 |
| ||
| Gamma |
| 2.7 | 2,265 | 43.0 | 2,260 | 1,017 |
| |
| 4.6 | 4,639 | 51.0 | 4,072 | 1,683 |
| |||
|
| 2.6 | 2,546 | 74.5 | 2,506 | 125 |
| ||
| Delta |
| 3.7 | 3,259 | 46.5 | 2,529 | 1,040 |
| |
|
| 13.5 | 10,871 | 70.5 | 13,027 | 1,199 |
| ||
| 1.7 | 1,824 | 37.0 | 612 | 404 |
| |||
| Epsilon | 3.0 | 2,921 | 33.5 | 1,557 | 931 |
| ||
|
| 1.6 | 1,557 | 39.5 | 975 | 411 |
| ||
|
| 1.9 | 2,041 | 34.0 | 987 | 546 |
| ||
| Thermodesulfobacteriota |
| 2.3 | 2,279 | 45.0 | 1,297 | 577 |
| |
|
| 2.1 | 2,171 | 50.0 | 1,211 | 395 |
| ||
|
| 1.8 | 1,832 | 38.5 | 978 | 468 |
| ||
| Nitrospirota |
| 4.6 | 4,576 | 62.0 | 3,788 | 635 |
| |
|
| 2.6 | 2,603 | 54.0 | 1,940 | 459 |
| ||
| 2.0 | 2,060 | 34.0 | 873 | 527 |
| |||
| Aquificota |
| 1.6 | 1,728 | 27.5 | 710 | 647 |
| |
| 1.6 | 1,820 | 43.5 | 1,022 | 468 |
| |||
|
| 1.8 | 1,947 | 46.0 | 1,175 | 494 |
| ||
| Bacteroidota |
| 3.4 | 3,100 | 33.5 | 2,411 | 1,745 | I | |
|
| 5.5 | 4,745 | 33.5 | 4,118 | 2,884 |
| ||
|
| 2.8 | 2,410 | 46.0 | 1,903 | 728 |
| ||
| Acidobacteriota |
| 5.7 | 4,666 | 60.0 | 5,307 | 1,009 |
| |
|
| 10.8 | 6,706 | 57.5 | 7,776 | 2,149 |
| ||
|
| 3.7 | 3,103 | 61.5 | 3,109 | 576 |
| ||
| Spirochaetota | 4.7 | 4,342 | 42.5 | 3,926 | 1,775 |
| ||
|
| 1.0 | 970 | 27.5 | 432 | 356 |
| ||
| 2.6 | 2,371 | 61.0 | 2,012 | 412 |
| |||
| Chlamydiota |
| 1.1 | 1,017 | 40.0 | 789 | 418 |
| |
|
| 1.2 | 1,093 | 40.5 | 842 | 430 |
| ||
|
| 1.2 | 1,036 | 39.0 | 782 | 431 |
| ||
| Bacillota |
| 4.2 | 4,536 | 43.5 | 2,955 | 1,948 |
| |
|
| 4.9 | 4,228 | 35.5 | 3,197 | 2,220 |
| ||
|
| 2.4 | 2,593 | 46.0 | 1,007 | 610 |
| ||
| Actinomycetota |
| 5.0 | 4,633 | 65.0 | 5,512 | 705 |
| |
|
| 2.9 | 2,699 | 68.0 | 3,293 | 246 |
| ||
|
| 2.2 | 2,107 | 68.5 | 2,283 | 152 |
| ||
| Deinococcota |
| 4.7 | 4,854 | 64.0 | 4,744 | 810 |
| |
| 4.5 | 4,446 | 70.0 | 5,433 | 704 |
| |||
|
| 3.7 | 3,754 | 66.0 | 3,149 | 482 |
| ||
| Chloroflexota |
| 1.4 | 1,464 | 48.5 | 912 | 405 |
| |
|
| 4.2 | 3,751 | 54.5 | 2,724 | 904 |
| ||
|
| 3.6 | 3,083 | 59.5 | 2,214 | 789 |
| ||
| Cyanobacteria | 7.1 | 5,974 | 41.5 | 6,156 | 2,805 |
| ||
|
| 6.2 | 5,373 | 51.5 | 6,543 | 1,936 |
| ||
| 4.2 | 3,524 | 56.0 | 4,240 | 857 |
| |||
| Thermotogota |
| 3.0 | 2,700 | 45.5 | 1,814 | 824 |
| |
|
| 2.1 | 1,905 | 31.5 | 1,030 | 793 |
| ||
|
| 2.3 | 2,116 | 40.5 | 914 | 648 |
| ||
| Archaea | ||||||||
| Euryarchaeota |
| 1.6 | 1,714 | 31.0 | 1,190 | 876 |
| |
| 1.9 | 2,123 | 41.0 | 1,307 | 554 |
| |||
| 1.8 | 1,785 | 31.5 | 1,349 | 991 |
| |||
| Crenarchaeota |
| 1.8 | 2,012 | 55.0 | 2,002 | 395 |
| |
|
| 2.1 | 2,529 | 39.0 | 1,020 | 526 |
| ||
|
| 1.6 | 1,879 | 33.0 | 1,117 | 788 |
| ||
- —Academia Sinicahttp://dx.doi.org/10.13039/501100001869
- —Academia Sinicahttp://dx.doi.org/10.13039/501100001869
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Bacteriophages and microbial interactions
INTRODUCTION
Palindromic elements (PEs) contain inverted sequences that can fold into stable stem-loop secondary structures that are present in the genomes of all domains of life as well as in viruses. In bacteria, PEs include Rho-independent terminators (RITs) (1), repetitive extragenic palindromic elements (REPs) (for a review, see reference 2), and clustered regularly interspaced short palindromic repeat (CRISPR) sequences (for a review, see reference 3), all of which are mainly located in intergenic regions and are transcribed together with flanking genes. REPs were first found ~40 years ago in the enterobacteria Escherichia coli and Salmonella typhimurium (4), and then were subsequently identified and characterized in many other bacterial species (for a review, see reference 5). RNA sequencing revealed that nearly 80% of the 355 E. coli REPs annotated in the EcoCyc database (https://ecocyc.org) by 2015 are transcribed under aerobic conditions (6). REPs can be defined as a class of repetitive sequences that are (i) usually transcribed from extragenic regions (i.e., they seldom overlap with protein-coding regions) (4), (ii) palindromic (i.e., they form stem-loop structures) (4), (iii) relatively short (e.g., 14–40 nucleotides [nt] in E. coli; according to the EcoCyc database), (iv) highly similar (i.e., possess a common consensus motif) (7, 8), and (v) present in multiple copies per genome, depending on the species (for a review, see reference 5). E. coli REPs annotated in the EcoCyc database can be classified into two groups based on their location in an operon, i.e., 3*'*-terminal (in mono- and polycistronic operons) or intergenic (flanked by two coding regions of a polycistronic operon).
The preferential location of REP elements in close proximity to the gene ends has given rise to some hypotheses regarding their physiological roles, origin, and evolution (2). REPs in DNA are involved in chromosome organization, DNA topology, and nucleoid structures (9–11), with nucleoid DNA condensation potentially being mediated by REP-containing RNA. In particular, it has been reported that components of the REP325 element and at least one of its RNA products (nucleoid-associated noncoding RNA 4) play a role in bacterial nucleoid DNA condensation (6). Furthermore, several studies have indicated that the hairpin structure of a REP element in RNA can serve as a transcriptional attenuator, acting as a barrier to ribonucleolytic degradation and, hence, stabilizing REP-associated mRNAs and their intermediates (12–15). For instance, both in vitro and in vivo analyses have revealed that deleting REPs from the 3*'*-terminal untranslated regions of E. coli mRNAs reduces their half-life (15). Recently, ATP depletion was shown to stabilize REP-containing transcripts, based on whole-genome transcriptomics of the effect of fluoride treatment on E. coli gene expression under anaerobic conditions (16).
It has also been reported previously that nearly half of all E. coli REP sequences have the potential to stall ribosomes immediately upstream of the termination codon, resulting in endonucleolytic cleavage of the mRNA (17). In addition to the association of REPs with mono- and polycistronic mRNAs, they have also been identified within bacterial regulatory RNAs, such as small noncoding RNAs (sRNAs) (18). These sRNAs typically repress translation of target mRNAs by pairing with the ribosome-binding site, competing with initiating ribosomes, and often triggering mRNA decay (19). sRNAs are highly structured RNA molecules of 50–500 nucleotides, including trans-encoded sRNAs typically encoded within intergenic regions (20). Currently, 88 sRNAs are annotated in the EcoCyc database. Three E. coli sRNAs (SroC, C0362, and C0664) containing REP elements have already been discovered (21, 22). SroC destabilizes the sRNA GcvB and may act as a “sponge” by interacting with it (23). The functions of the other two REP-containing sRNAs, C0362 and C0664, remain unknown.
Given the important function of REPs in regulating genes, along with the ever-increasing number of bacterial genome sequencing and annotation projects and the availability of public bioinformatics tools, we felt that a web-based REP search platform would significantly facilitate our understanding of REP-related functions. The previously published ExtraTrain tool (24) is no longer accessible online. Consequently, we developed a new web-based tool, RepRanger (https://bc.imb.sinica.edu.tw/RepRanger/index.html), which can rapidly identify and annotate putative PEs, including REPs, in all publicly available annotated bacterial and archaeal genomes. Through comprehensive RepRanger-assisted analysis, we have discovered REPs in different phyla of bacteria and archaea, defined their consensus motifs, and described their occurrence and similarity across species. Overall, RepRanger can be utilized to enhance our understanding of REP sequence diversity, conservation, and potential functions across various genomes.
RESULTS
Development and use of the web-based tool RepRanger to predict and annotate candidate REPs
Although REP elements were discovered a long time ago, their biological roles, occurrence across many bacterial taxa, and sequence conservation all remained poorly defined. To fill these knowledge gaps, we have developed a novel web-based tool named RepRanger (Repetitive Extragenic Palindrome Ranger), which is aimed at facilitating prediction of REPs in the intergenic regions of annotated bacterial and archaeal genome sequence data. The RepRanger webserver (https://bc.imb.sinica.edu.tw/RepRanger/index.html) was developed in PHP and Python scripts. The workflow processes bacterial or archaeal genome sequence (FASTA) and annotation (GFF) data files from the NCBI Genome Database (https://www.ncbi.nlm.nih.gov/datasets/genome/) to generate output files containing candidate REPs according to the workflow depicted in Fig. 1A. A number of parameters (e.g., stem and/or loop length, stem-to-loop ratio, energy cutoff of the stem-loop or tail, and distance from the 5*'* end of the gene) can be adjusted via the RepRanger web interface (Fig. 1. Ba), where users can also input an email address to which a link to the results will be sent. RepRanger employs a multistep computational approach to identify palindromic sequences within the intergenic regions of a genome (see Materials and Methods for details), including decomposing genomes into k-mers, matching k-mer pairs to form palindromes, inferring hairpin structures, etc. These palindromic sequences are filtered based on the thermodynamics of RNA folding heuristics, the presence of uridine-rich tails characteristic of RITs, and several structural criteria pertaining to REPs (see Materials and Methods for details), ultimately generating a list of candidate REPs. This list can be visualized via the embedded Genome Browser (IGV.JS [25]) or downloaded in TSV, BED, and FASTA formats (Fig. 1. Bb). The TSV file (Fig. 1C) contains detailed annotation information of candidate REPs, including position, length, distance from the upstream and downstream open reading frames (ORFs), free energy, and sequence, among other data.
Workflow and use of the web-based tool RepRanger: (A) workflow for identifying candidate REPs in a genome, (B) screenshots showcasing the RepRanger user interface, including the prediction (Ba) and browse (Bb) pages, and (C) example of an output table (TSV file) generated by RepRanger.
Genome-wide prediction and analysis of REP candidates in Escherichia coli MG1655 using RepRanger
To test our newly developed tool, RepRanger, we used it to predict PE elements in the best-annotated bacterial genome, E. coli K-12 substrain MG1655 (26). According to the latest E. coli genome annotation (NCBI Reference Sequence: NC_000913.3), the maximum length of spacer/intergenic regions (i.e., regions potentially containing REPs) separating collinear genes is 6,176 base pairs (bp). Consequently, in RepRanger parameters, we used a maximum distance of 6,200 nucleotides between a REP and the upstream gene to ensure that our analysis encompassed all intergenic regions. We set 10 nt as the minimum cutoff length for PEs and other parameters (see Materials and Methods for details). RepRanger-assisted prediction identified 5,755 REP candidates, including 4,072 REPs (~71%) and 1,683 RITs (~29%) (Fig. 2A).
Putative REPs and RITs in E. coli MG1655, as identified by RepRanger, as well as their length distributions and chromosomal locations. Numbers (A) of REPs and RITs in the E. coli MG1655 genome, as well as their length distributions (Ba, b and Ca, b). (Da and b) Diagrams of the localizations of REPs and RITs in the E. coli genome, and (Dc and d) their number distribution per 100 kb of DNA on the (+) and (−) strands. The number of REPs or RITs is indicated for each section of the genome (47 in total; each represents a 100 kb region, with the 47th section representing 41,652 bp).
Further analysis of the predicted elements revealed that the lengths of REPs and RITs ranged from 10 to 71 nt (Fig. 2. Ba) and 10 to 60 nt (Fig. 2. Ca), respectively. Despite some differences in length distribution patterns, the vast majority of the predicted REPs and RITs (~88% and ~86%, respectively) ranged from 10 to 30 nt in length (Fig. 2. Bb and Cb, respectively), which is consistent with previously reported ranges (7, 8). Next, we explored if REPs and/or RITs are distributed uniformly across the genome or show signs of clustering or periodicity. We found that REPs and RITs were distributed throughout the entire genome on both the (+) and (−) strands, with no clear evidence of clustering or periodicity (Fig. 2. Da and b). Specifically, the number of REPs (per 100 kb of DNA) varied from 26 to 57 (95% confidence interval [CI]: ~42 to ~53) on the (+) strand and from 24 to 62 (95% CI: ~39 to ~49) on the (−) strand (Fig. 2. Dc). In comparison, the number of RITs varied from 7 to 28 (95% CI: ~16 to ~22) on the (+) strand and from 11 to 28 (95% CI: ~15 to ~21) on the (−) strand (Fig. 2. Dd). However, there are regions on the chromosome that contain a number of REPs or RITs falling well outside the 95% CI (Fig. 2. Dc and d).
Identity and occurrence of REP consensus motifs in Escherichia coli REPs
The first consensus motif for E. coli REPs has been proposed as GC(g/t)GATGGCG(g/a)GC(g/t) … (g/a)CG(c/t)CTTATC(c/a)GGCCTAC, based on approximately 30 annotated REPs (8). Subsequently, analysis was extended to all 356 REP elements annotated in EcoCyc, which revealed the following consensus sequence GCCGGATGCGGCGTGAACGCCTTATCCGGCCTACGA (7) that overlaps with the first reported motif (8). Advances in annotating the E. coli genome have led to a significant increase in the number of new REPs. Here, RepRanger identified a total of 4,072 REPs, which is approximately sevenfold more than the 697 REPs annotated in EcoCyc, indicating that all of these REPs do not necessarily have the same consensus motif. To test this idea, we employed the MEME web tool (27) and uncovered 10 consensus motifs (Fig. 3A) among the 4,072 REPs identified by RepRanger, with the top four hits having an E-value of less than 0.05. Our motif 1 is similar to the motif reported previously (7, 8), but we describe nine novel REP motifs for the first time. Next, we analyzed the occurrence of all 10 of these motifs among the 4,072 REPs using the FIMO web tool (28), which revealed that motif 1 displayed the highest occurrence (~50.5%) among the analyzed motifs (Fig. 3B). The variety of REP consensus motifs implies that they may be involved in different biological functions beyond currently known translational control (17) (see results below).
REP consensus motifs identified in E. coli MG1655 generated by the MEME tool, the genomic location of REPs potentially involved in translational control, and gene ontology (GO) analysis of REP-containing genes. (A) REP consensus motifs discovered in the E. coli MG1655 genome based on MEME analysis. Motifs with an E-value >0.05 are shown as partially transparent. Motif 1, which is similar to previously proposed motifs (7, 8), is highlighted by a black rectangle. (B) Occurrences of the REP consensus motifs. We used the FIMO web tool to calculate occurrences of the motifs (P-value <0.05). (C) Numbers of REPs located ≤15 or ≥16 nt from an upstream stop codon and (Da and b) the occurrences of the REP consensus motifs. (E) Locations of REPs along the E. coli chromosome. The red histograms in the plot represent the numbers of pathways that each gene is involved in, with green lines showing the links between genes that are in the same pathway. (F) Functional enrichment analysis of genes with REPs located ≤15 and ≥16 nt from an upstream stop codon. (G) Pathways of genes with REPs located ≤15 nt from the upstream stop codon (motif 4). Functional enrichment analysis was performed using the ShinyGO tool (http://bioinformatics.sdstate.edu/go/) with a false discovery rate (FDR) cutoff of 0.05. Fold Enrichment: measures the magnitude of enrichment (higher values indicate stronger enrichment and are an important metric of effect size). Pathway Genes: the total number of genes in a pathway or GO term. Pathways specifically associated with genes containing REPs located ≤15 or ≥16 nt from an upstream stop codon are indicated by black stars.
Identification of REPs potentially involved in translational control in E. coli
A previous study revealed that a significant proportion of REPs is located 1–20 nt away from the 3′ end of the nearest ORF and demonstrated that REPs can downregulate translation but only if they are positioned fewer than 16 nt from the stop codon (17). To search for REPs predicted by RepRanger that may function in translational control, we first categorized the REPs into two groups: (i) those with a distance from stop codon not exceeding 15 nt (i.e., putative translational regulators) and (ii) those located ≥16 nt away (unknown functions). We found that only 446 out of 4,072 putative REPs (~11% of the E. coli REPs identified) belonged to the first group (Fig. 3C) and, therefore, may regulate the translation of upstream genes, while the remaining 3,626 (~89%), i.e., the majority of REPs, belonged to the second group and may have other functions.
Since we identified a much higher number of REPs with a distance from the stop codon not exceeding 15 nt (n = 446) than previously found (17), first we examined if REPs that potentially downregulate translation possess a unique consensus motif. As shown in Fig. 3. Da and b, REPs hosting motifs 1 (~35%) and 4 (~26%) proved the most common among the putative translation-linked REPs (n = 446; Fig. 3. Da), with motif 1 also displaying the highest occurrence among the remaining 3,626 REPs (~54%; Fig. 3. Db). Subsequently, we analyzed the intergenic locations of REPs on the E. coli chromosome and determined the biological pathways in which each adjacent gene is involved (by EcoCyc) as well as the specific connections between genes in the same pathway (Fig. 3E).
In order to understand the function of the genes associated with the REPs in these two groups, we employed the ShinyGO web tool (http://bioinformatics.sdstate.edu/go/) (Fig. 3F). The REPs located within 15 nt of the nearest ORFs (group 1) were specifically associated with purine/pyruvate metabolism and arginine biosynthesis (Fig. 3F). The REPs located ≥16 nt away from their nearest ORF (group 2) were specifically associated with the metabolism of cysteine, methionine, alanine, aspartate, glutamate, glycine, serine, and threonine (Fig. 3F).
In addition, our data show that motif 4 (Fig. 3. Da and b) is the most common within putative translation-linked REPs, and it occurred in ~3.3-fold more frequently in group 1 than in group 2. The ShinyGO web tool revealed that the vast majority of genes with motif 4 in group 1 control carbohydrate metabolism (Fig. 3G).
Over half of sRNAs contain REPs
REP sequences can also be found within some regulatory RNAs, such as sRNAs that play an important role in bacterial stress adaptation (18). To date, 88 sRNAs have been annotated in the EcoCyc database, and 3 of them (SroC, C0362, and C0664) have been identified as containing REP elements. Given the regulatory roles of sRNAs, we postulate that REPs may have broader biological significance. Since the sRNAs subjected to our analysis do not contain an ORF and therefore were not annotated in the RepRanger output files, they were mapped to the E. coli genome using the Integrative Genomics Viewer (IGV) program (https://igv.org/doc/desktop/). This analysis uncovered that 51 (~58%) of the sRNAs host REPs (Table 1). In fact, the number of sRNAs that contain REPs might be considerably higher. Our recent analysis (29) of RNA sequencing profiles obtained for total RNA extracted from wild-type E. coli MG1655 made it possible to identify new putative sRNAs expressed in this strain (see Materials and Methods, and Table S1). Among the 69 newly identified sRNAs in the current study, we found that 30 (~44%) contain REPs. Collectively (Fig. 4A), we have found REPs in almost 52% (n = 81) of E. coli sRNAs (i.e., from among the 88 EcoCyc-annotated sRNAs and 69 newly identified ones).
REPs associated with E. coli MG1655 sRNAs, their consensus motifs, and putative biological pathways regulated by these sRNAs. (A) Number of REP-hosting and non-REP-harboring sRNAs and (Ba) occurrences of 10 REP consensus motifs (see Fig. 3A) in the REP-hosting sRNAs. (Bb) Putative pathways regulated by REP-hosting sRNAs sharing the most abundant motif 8. Location (Ca) of REP-hosting sRNAs and non-REP-harboring sRNAs along the E. coli chromosome. The sRNAs for which expression has been validated are shown (black arrow). (Cb) Validation of predicted REP-hosting sRNA expression via Northern blot analysis. A comparison of reads corresponding to mapped small RNAs in the E. coli genome is shown at the top of each Northern blot panel. The y-axis represents the number of RNA-seq reads. REPs are shown in blue, and sRNAs are represented by an empty rectangle, respectively. Hybridizations were performed with probes specific for the selected sRNAs under aerobic growth conditions. The 5S rRNA served as an internal loading control. The expected sizes (in nucleotides) of the full-length sRNAs are indicated. The molecular ladder was obtained by hybridizing total RNA with radiolabeled probes specific for rnpB (M1) RNA (377 nt), 6S RNA (183 nt), 5S rRNA (120 nt), and tRNA-Asn (75 nt). Three biological replicates were performed, and representative images are shown. (Da) Number of REP-hosting sRNA and non-REP-harboring sRNA putative target genes, and (Db) putative pathways regulated by REP-hosting and non-REP-harboring sRNAs. Pathways specifically regulated by REP-hosting or non-REP-harboring sRNAs are indicated by black stars.
We anticipate that some sRNAs are descended from the same original REP element, and therefore, certain REP motifs could be relevant to sRNA regulation. We analyzed the occurrences of all 10 motifs (see Fig. 3A) in the 81 REP-hosting sRNAs (Fig. 4. Ba) and found that motifs 5, 8, and 9 were the most common in the sRNA-associated REPs, with motif 8 displaying the highest occurrence (~21%). In addition, we assessed the sRNAs whose REP contains motif 8 (i.e., that with the highest occurrence) and observed that their targets control many genes involved in central carbon metabolism, biosynthesis of secondary metabolites, microbial metabolism in diverse environments, among others (Fig. 4. Bb).
We selected four of the newly discovered REP-hosting sRNAs to validate their genomic location (Fig. 4. Ca), and their expression was confirmed by Northern blotting (Fig. 4. Cb).
An increased number of REP-containing sRNAs may also enhance the number of putative targets controlled by these sRNAs. Indeed, a comparison of the sRNA targets identified by TargetRNA software (30) revealed 664 target genes for REP-hosting sRNAs and 649 target genes for non-REP-harboring sRNAs (Fig. 4. Da). Further analysis using the ShinyGO web tool revealed that REP-hosting sRNA target genes are specifically associated with controlling two-component systems, RNA degradation, and base excision repair, whereas the non-REP-harboring sRNA target genes apparently regulate starch and sucrose metabolism (Fig. 4. Db).
REP consensus motifs are similar across pathogenic and environmental E. coli strains but not in commensal or laboratory strains
With the aim of gaining insight into REP occurrence in the genus Escherichia of Family Enterobacteriaceae, we analyzed their presence and sequence similarity across the genomes of all major groups of E. coli pathogens (Table 2) (31) as well as each group of nonpathogenic E. coli strains (i.e., commensal, environmental, and laboratory). More specifically, we used RepRanger to analyze three representative annotated genomes from each of those groups (Table 2).
Since E. coli strains can be found in diverse environments, it is possible that REPs present in bacteria originating from the same environment share common motifs. The REP consensus motifs we identified in enterohemorrhagic E. coli (EHEC; intestinal pathogenic strains), avian pathogenic E. coli (APEC; extraintestinal pathogenic strains), and environmental nonpathogenic strains are highly conserved in other groups of E. coli (Fig. 5. Aa and b). Enteroaggregative (EAEC) strains presented the highest number of REPs with significant similarity to those identified in the other E. coli strains we analyzed. Together, these data indicate that REP consensus motifs are similar across pathogenic and environmental E. coli strains but differ from those found in commensal and laboratory strains.
Motif similarity in pathogenic and nonpathogenic E. coli strains. (Aa) Motif similarity among pathogenic and nonpathogenic E. coli strains and (Ab) total number of significant matches of the motifs. We used the TomTom web tool to calculate the similarity of the motifs (P-value <0.001) identified using MEME. Pathogenic E. coli: EPEC, enteropathogenic; STEC, Shiga toxin producing; VTEC, verocytotoxin producing; ETEC, enterotoxigenic; EIEC, enteroinvasive; DAEC, diffuse adhering; AIEC, adherent invasive; UPEC, uropathogenic; NMEC, neonatal meningitis associated; SEPEC, sepsis associated.
REPs and RITs are widely distributed across bacteria and archaea
In order to compare the number of REPs between members of bacteria and archaea, we used RepRanger to identify putative REPs in representative genomes from the two domains. Some phyla have a small number of species and available annotated NCBI sequencing data; therefore, only major phyla with sufficient annotated NCBI sequencing data were used. Three species per phylum were selected for analysis, as defined in a previous study (32), as well as from each class within the Pseudomonadota (formerly Proteobacteria). We found that REPs and RITs are widely distributed across bacteria and archaea, and their numbers of REPs and RITs were largely proportional to the size of the corresponding genome (Fig. 6A).
Numbers of putative REPs and RITs discovered by RepRanger in different bacteria and archaea and REP motif similarity. (A) REPs and RITs normalized to the size or GC content (B) of the cognate genomes. Each black dot or triangle represents one species of bacteria or archaea, respectively. The confidence bands above and below the fitted line represent 95% confidence intervals (in blue and green for REPs and RITs, respectively). (Ca) Motif similarity among the bacterial and archaeal strains assessed in this study and (Cb) total number of significant matches of the motifs. We used the TomTom web tool to calculate the similarity of the motifs (P-value <0.001) identified using MEME.
Bacteria are known for the variability of GC content in their genomes. Since GC pairs are generally more stable than AT pairs, it has been proposed that GC-rich genomes are better adapted to high temperatures than AT-rich ones (33). Consequently, the increased stability and functionality of PEs with higher GC content may also promote their retention in organisms having GC-rich genomes. We compared the numbers of REPs and RITs in bacteria and archaea having high or low GC content. As illustrated in Fig. 6B, the number of REPs is proportionally higher in GC-rich genomes, whereas the number of RITs appears to decrease with increasing genomic GC content. Examples including Baaleninema simplex (Cyanobacteria), Pyrolobus fumarii (Crenarchaeota), and Lichenicola cladoniae (Alphaproteobacteria) had 1,055 (51.5% GC), 1,112 (55.0% GC), and 1,369 REPs (65.5% GC), respectively, while Borrelia bissettiae (Spirochetota) had 89 RITs (27.5% GC).
Comparison of REP consensus motifs reveals relatively low sequence similarity across bacterial and archaeal genomes
Our analysis of REPs indicates that certain motifs could potentially be different in phylogenetically distant organisms, i.e., bacteria and archaea. To evaluate sequence conservation of the REPs identified in bacteria and archaea using RepRanger, we again employed MEME to identify the consensus motifs in the REP sequences, which were then compared using TomTom (34) (Fig. 6. Ca and b). We found that REPs identified in Spirochaetota display the greatest motif similarity to those found in other bacterial and archaeal species (Fig. 6. Ca and b). In contrast, motifs shared among REP sequences in the Thermotogota exhibit no similarity to those of other bacteria and archaea. The REP motifs found in delta-Proteobacteria presented the highest total number of significant matches to REP motifs of other Proteobacteria, whereas those of alpha-Proteobacteria exhibited the lowest similarity. In terms of archaea, Euryarchaeota exhibited higher motif similarity to other taxa than Crenarchaeota (Fig. 6. Ca and b).
DISCUSSION
PEs are encoded in the genomes of organisms belonging to all domains of life. Although bacterial genomes are smaller and more compact than those of eukaryotes, they still contain a substantial number of PEs, which perform various biological functions (35, 36). Among them, REP elements are small palindromic sequences of mostly 20–50 nt in length. Numerous REPs have already been found in bacterial genomes, ranging from ~700 in E. coli MG1655 (697 REPs, https://ecocyc.org) to thousands in some Pseudomonas strains (37). REPs have been implicated in a variety of cellular functions, including genome structuring and plasticity, modulation of stress responses as well as the regulation of gene expression at both transcriptional and post-transcriptional levels (2, 8, 10, 12, 17, 38–40). However, previous experimental data and bioinformatics analyses have not yet provided a complete understanding of the biological roles of REPs. Exploring the biological functions of REPs is complicated by the fact that they have only been annotated in a few model organisms, and their presence in different classes of bacteria and archaea has not been examined systematically.
Given the growing number of whole-genome sequences available for a diverse range of distantly related microorganisms, it is now possible to identify and compare various regulatory elements across different bacterial and archaeal species. Previously, REPs had been identified in >80 species of 8 different prokaryotic phyla (41). Here, we used our novel RepRanger tool to search for REP-like sequences in the genomes of 16 different bacterial and archaeal taxa, corresponding to the main branches of the latest phylogenetic trees (32). Our analysis has revealed that REPs and RITs are widespread in both those domains of life (Table 3).
We used RepRanger to search for palindromic elements in the genome of the model organism E. coli MG1655. We found approximately sixfold and approximately fivefold more REPs and RITs (Fig. 2A), respectively, in the E. coli MG1655 genome compared to those annotated in the EcoCyc database. The close proximity of some REPs to a stop codon may indicate that they are involved in regulating translation, especially given the fact that REPs occurring within 15 nt of an upstream stop codon stall ribosomes (17). This latter study proposed that 260 of the 496 REP sequences identified by the authors act in translational regulation. In the current study, RepRanger identified 446 of 4,072 REPs as meeting the criterion of being within 15 nt of a stop codon and thus likely to contribute to regulating translation (Fig. 3C), including the previously reported nrdAB REP-containing transcript (17). Nevertheless, although we identified a larger number of putative translation-regulating REPs, the proportion of all E. coli REPs we identified is ~1.7 times smaller than that reported previously (17), likely due to the larger REP data set we generated herein. Further experimental work will be necessary to test if all of these REPs are indeed involved in translational control.
Some REPs might play essential roles in regulating gene expression at the post-transcriptional level. Previous studies have shown that stable structures formed by REPs can serve as a barrier to RNA degradation by exoribonucleases, thereby stabilizing mRNA transcripts and their decay intermediates (13, 15). Our RepRanger tool could prove useful in detecting upstream and downstream fragments stabilized by REPs and/or RITs and in identifying the distribution and frequency of potential RNase cleavage sites in their vicinity.
One criterion for identifying REPs is their strong sequence similarity, which is characterized by a shared consensus motif. However, the REP sequence consensus motif(s) is not yet well defined. Although an initial consensus sequence (based on ~30 E. coli REPs) of GC(g/t)GATGGCG(g/a)GC(g/t) … (g/a)CG(c/t)CTTATC(c/a)GGCCTAC was proposed almost 40 years ago (8), a previous positional analysis using a considerably larger number (356) of annotated REP elements revealed that 224 of them share a slightly different GCCGGATGCGGCGTGAACGCCTTATCCGGCCTACGA motif (7). There are currently 697 REPs annotated in the EcoCyc database. It is likely that as the number of identified REPs increases, the number of conserved motifs and their consensus sequences will change. In the present study, we identified even more REPs and determined consensus motifs across the entire set of E. coli REPs. We used the MEME web tool (27) to assess REP sequence similarity and found that we could cluster the 4,072 putative E. coli MG1655 REPs into 10 groups based on distinct motifs, with the top 4 hits having an E-value of ≤0.05 (see Fig. 3A). Among these 10 motifs, motif 1, which is shared by ~50.5% of the 4,072 REPs (Fig. 3B), is very similar to previously proposed motifs (7, 8), whereas the other motifs appear to be novel.
In addition to being present in coding transcripts, we also identified REPs within sRNAs, both those annotated in the EcoCyc database and newly identified sRNAs in this study (see Table S1), all of which may exert essential roles in bacterial adaptation to stress. Using RepRanger, we found that in addition to the three known sRNAs hosting REP sequences already annotated in the EcoCyc database (i.e., SroC, C0362, and C0664), several additional sRNAs host REPs. In fact, our analysis uncovered that 81 sRNAs (~52%) in E. coli MG1655 are associated with REPs belonging to 10 different groups and that the REP sequences in each of those groups possess similarity to a specific motif. Moreover, REP sequences showing similarity to motifs 5, 8, and 9 proved the most common in our data set, indicating that these consensus sequences might help sRNAs to exert their functions. Nevertheless, the functions of REPs in sRNAs require further study.
We also assessed the conservation of REP sequences across several pathogenic and nonpathogenic E. coli strains. Although REPs are present in all of the pathogenic strains we analyzed (see Table 2), the REPs in a few strains (i.e., EHEC, APEC, and environmental strains) appear to share sequence similarity with many REP sequences of other pathogenic strains (Fig. 5. Aa and b). We observed that the REP consensus motifs we have identified are similar among pathogenic and environmental E. coli strains but differ from those found in commensal and laboratory strains.
Furthermore, using RepRanger combined with other bioinformatics tools enabled us to predict and classify bacterial and archaeal REP motifs based on their similarity (Fig. 6. Ca and b). Interestingly, we uncovered high sequence similarity of Spirochaetota REPs to those identified in other bacterial taxa. In contrast, Thermatogota REPs showed no sequence similarity to those of other bacteria or archaea. Prior to the current study, published information on the REPs in those strains had been lacking.
In summary, our findings demonstrate the high versatility of RepRanger in facilitating the discovery and analysis of PEs in the genomes of two domains of life (bacteria and archaea). We found that ~52% of the stress-related sRNAs in E. coli MG1655 possess REPs. Moreover, our extensive analysis of the PEs present in bacteria and archaea has revealed that REPs are widespread in both kingdoms of life and allowed us to uncover consensus motifs shared by different groups of these genetic elements.
MATERIALS AND METHODS
Prediction of putative REP candidates by RepRanger
To find candidate REPs in a genome’s intergenic regions, the following steps are taken. First, the intergenic regions are broken into k-mer sequences (default k = 6). Next, palindromic k-mer pairs are identified and filtered based on a distance threshold (50 bp). These sequences are then extended in both directions until a mismatch is encountered, and then adjacent k-mer pairs are assembled if the mismatch gaps are below a certain length (default = 2). Finally, REPs are distinguished from Rho-independent transcription terminators by the presence of thymine-rich tails. Specifically, RITs are characterized by distinct structural features—short stem-loop hairpin formations immediately followed by a thymine-rich sequence at their 3′ end. In this study, RITs were identified using the methodology described in equation 1 of the TransTermHP method, as detailed by Kingsford et al. (42). This equation calculates a “tail score,” which assigns greater weight to thymine residues located within the 15 nucleotides at the 3′ tail of the hairpin structure (weighting parameters: T = 0.9; others = 0.6). The threshold for the tail energy score is set at −2.5. In addition to this tail score threshold, all other parameters used to evaluate the energy of potential hairpin structures were directly adopted from the TransTermHP method. These parameters, detailed in Table 1 of Kingsford et al. (42), include specific energy values for nucleotide pairings (such as G-C, A-T, and G-T), as well as penalties for mismatches, gaps, and loop formations [Loop_pen(n) = 1 × (n − 2)]. The specific energy values used were G-C pairing at −2.3, A-T pairing at −0.9, G-T pairing at 1.3, a mismatch penalty of 3.5, and a gap penalty of 6.0. All thresholds utilized in our methodology for predicting RITs were adopted directly from the Kingsford et al. study. Their comprehensive approach was originally applied to predict terminators across 343 bacterial and archaeal genomes, representing the full collection of complete prokaryotic genomes available in GenBank at the time. Our study adheres to these established and validated parameters to enable robust terminator identification. Therefore, we adopted the above well-established criteria (including threshold values) to filter out RITs and classified all remaining PE elements as putative REPs.
The default parameters used in RepRanger—such as palindrome length (minimum: 5 nt, maximum: 32 nt), loop length (maximum: 11 nt), stem length (minimum: 1 nt), distance to the gene 5′ end (maximum: 200 nt), stem-to-loop length ratio (minimum: 0.2), mismatch gap between adjacent k-mer pairs (maximum: 2 bp), energy cutoff for hairpin structures (–0.2 kcal/mol), and tail energy cutoff (–2.5 kcal/mol)—are derived from the 697 annotated REPs in the EcoCyc database and ensure that the identified palindromes are both relevant and structurally suitable to function as REPs.
For our analysis, we selected a minimum cutoff length of 10 nt for PEs, based on the length required to form a minimal secondary structure (Fig. S1), as predicted using the RNAfold web server (http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi).
Prediction and comparison of REP consensus motifs
We classified REP sequences into two categories: (i) pathogenic and nonpathogenic E. coli, and (ii) other bacteria and archaea. We then used the MEME Suite (27) to generate motifs for the identified REP sequences. MEME employs the expectation-maximization algorithm to discover novel ungapped motifs in unaligned sequences. In the expectation step, it calculates the probability of each possible motif occurrence in the sequences, given the current motif model, and it updates the motif model to maximize the likelihood of the observed sequences in the maximization step. The expectation and maximization steps are repeated iteratively until convergence.
Next, we used TomTom (34) to compare these motifs between two groups. TomTom ranks the motifs based on their similarity, providing P-values (default: 0.001) to indicate the statistical significance of the matches. This comparison helps to identify similar motifs among REP groups, thereby providing insights into their potential biological functions.
We estimated similarity scores between any two REP groups according to three variables:
Na = number of unique motifs in group A.Nb = number of unique motifs in group B.Nab = number of similar motifs identified between groups A and B.
The similarity score was then defined as Nab/(Na + Nb).
Novel sRNA prediction
We established the following criteria for identifying new putative sRNAs using an RNA-seq data set published previously (NCBI GEO accession no. GSE189154) (29):
sRNA length is ~50–400 nt;located in intergenic regions (20);possible overlap with the coding region of annotated genes;expression profile of each putative sRNA is different from the flanking genes/regions;sRNA expression levels are different under differential growth conditions (i.e., aerobic vs microaerobic) or different genetic background (i.e., MG1655 wild type vs Rned823-850 mutant);binds to the RNA chaperone Hfq (43).
Bacterial strain, growth condition, RNA isolation, and Northern blot analysis
To prepare subcultures from fresh overnight cultures, the E. coli MG1655 strain was grown overnight at 37°C for 16 h in M9 medium supplemented with 0.4% glucose and trace elements [462.56 µM H_3_BO_3_, 34.71 µM MnCl_2_, 10.80 µM FeCl_3_, 7.72 µM ZnSO_4_, 3.16 µM CuSO_4_, 2.30 µM (NH_4_)6_Mo_7_O_24, 1.68 µM Co(NO_3_)2]. A 16-h fresh overnight culture was diluted into 750 mL fresh M9 medium (to OD_460_ = 0.04–0.05) in a 1-L fermentation vessel chamber (Winpact Parallel Fermentation System FS-05-220, Saratoga, CA, USA). To create cultivation conditions, air was continuously pumped into the chamber at a rate of 0.4 L/min (liters per minute). The culture was grown at 200 rpm, 37°C, and maintained at pH 7.0 by automatic titration with sterile 1 M KOH. The culture was harvested at OD_460_ = 0.5–0.6 for further analyses. In brief, 42 mL of culture from multiple biological replicates was collected into 50 mL tubes with 7 mL (1/6 vol) of ice-cold stop solution (5% phenol and 95% ethanol [vol/vol]) for RNA isolation (see details below). Bacterial pellets were harvested following centrifugation at 4,000 × g, 4°C for 15 min, and stored at −80°C before use.
Total RNA was extracted as described previously (44). In brief, bacterial pellets were resuspended in 4 mL KJ medium (50 mM glucose, 25 mM Tris-HCl pH 8.0, 10 mM EDTA pH 8.0, 100 mM NaCl), lysed by placing into boiling 4 mL buffer (0.2 M NaCl, 20 mM Tris-HCl pH 7.5, 40 mM EDTA, 0.5% SDS), and boiled in a boiling water bath for 45 s before adding 4 mL of acidic phenol (pH = 4.5). The solution was mixed gently by slowly inverting the tube ~20 times. Total RNA was extracted in aqueous phase by centrifugation at 4,000 × g, 4°C for 1 h. The RNA was precipitated in 1 vol of isopropanol and 1/10 vol of 3 M sodium acetate (pH 7.8) at −20°C. All RNA samples were maintained in isopropanol at −20°C before use. When RNA isolation was performed on aliquots, the same volume of culture from the same batch of biological replicates was used for Northern blot analysis.
For Northern blot analysis, RNA was separated on a 7 M urea gel with 6% or 8% polyacrylamide (acrylamide/bis-acrylamide 19:1) in 0.5× Tris-borate-EDTA (TBE) buffer and electrophoresed at 120 V until the xylene cyanol dye had reached 3/4 of the length of the gel. The RNA was transferred onto Zeta-Probe Blotting membranes (Bio-Rad, Hercules, CA, USA) at 400 mA (100 min at 4°C) in 0.5× TBE buffer and cross-linked to the membrane using a Stratalinker 2400 UV Crosslinker (Stratagene). The membrane was pre-blotted with ULTRAhyb Ultrasensitive Hybridization Buffer or ULTRAhyb-Oligo hybridization buffer (Invitrogen) for 2 h at 65°C or 42°C, respectively. An antisense 5′-end [γ−32P]-labeled DNA oligo probe was used to detect target sRNAs. Probe sequences are listed in Table S2. T4 polynucleotide kinase (NEB) was used to generate isotope-labeled probes with [γ-^32^P] ATP. Radioactive probes were purified using a MicroSpin G-25 column (GE Healthcare) before being added into the hybridization buffer for RNA detection at 42°C, respectively, for at least 6 h. Wash solutions (2× or 0.5× saline sodium citrate [SSC], 0.1% sodium dodecyl sulfate [SDS]) were applied to remove nonspecific signals. Northern blot signals were captured using super-resolution BAS Storage Phosphor Screening (GE Healthcare) and visualized with a GE Amersham Typhoon system.
Bioinformatics
The BioCyc Database Collection (https://biocyc.org) was used to identify REPs in bacteria and archaea. Motif occurrences were scanned using FIMO (28). Circos plots were generated using the circlize (v0.4.15) (45) R packages, respectively. The freely available tool ShinyGO v0.66 (http://bioinformatics.sdstate.edu/go/) (46) was used for gene ontology enrichment analysis.
Quantification and statistical analysis
All statistical tests were performed using GraphPad Prism version 9.0.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Wilson KS, von Hippel PH. 1995. Transcription termination at intrinsic terminators: the role of the RNA hairpin. Proc Natl Acad Sci USA 92:8793–8797. doi:10.1073/pnas.92.19.87937568019 PMC 41053 · doi ↗ · pubmed ↗
- 2Higgins CF, Mc Laren RS, Newbury SF. 1988. Repetitive extragenic palindromic sequences, m RNA stability and gene expression: evolution by gene conversion? A review. Gene 72:3–14. doi:10.1016/0378-1119(88)90122-93072249 · doi ↗ · pubmed ↗
- 3Sorek R, Kunin V, Hugenholtz P. 2008. CRISPR — a widespread system that provides acquired resistance against phages in bacteria and archaea. Nat Rev Microbiol 6:181–186. doi:10.1038/nrmicro 179318157154 · doi ↗ · pubmed ↗
- 4Higgins CF, Ames GF, Barnes WM, Clement JM, Hofnung M. 1982. A novel intercistronic regulatory element of prokaryotic operons. Nature 298:760–762. doi:10.1038/298760 a 07110312 · doi ↗ · pubmed ↗
- 5Lupski JR, Weinstock GM. 1992. Short, interspersed repetitive DNA sequences in prokaryotic genomes. J Bacteriol 174:4525–4529. doi:10.1128/jb.174.14.4525-4529.19921624445 PMC 206247 · doi ↗ · pubmed ↗
- 6Qian Z, Macvanin M, Dimitriadis EK, He X, Zhurkin V, Adhya S. 2015. A new noncoding RNA arranges bacterial chromosome organization. M Bio 6:e 00998-15. doi:10.1128/m Bio.00998-1526307168 PMC 4550694 · doi ↗ · pubmed ↗
- 7Markelova Ny, Masulis IS, Ozoline ON. 2016. REP-elements of the Escherichia coli genome and transcription signals: positional and functional analysis. Math Biol Bioinf 11:t 1–t 14. doi:10.17537/2016.11.t 1 · doi ↗
- 8Stern MJ, Ames G-L, Smith NH, Robinson EC, Higgins CF. 1984. Repetitive extragenic palindromic sequences: a major component of the bacterial genome. Cell 37:1015–1026. doi:10.1016/0092-8674(84)90436-76378385 · doi ↗ · pubmed ↗
