Family C of Short Interspersed Elements in the Genomes of Lagomorphs: Structure, Evolution, Transcription and Transcript Polyadenylation
Ilia G. Ustyantsev, Sergei A. Kosushkin, Dmitri A. Kramerov, Danil V. Stasenko, Olga R. Borodulina

TL;DR
This study explores the C SINE family in lagomorphs, revealing their structure, evolutionary history, and role in transcription and polyadenylation.
Contribution
The study identifies and characterizes the C SINE family in lagomorphs, including their polyadenylation mechanism and evolutionary activity.
Findings
C SINEs are present in over a million copies in lagomorph genomes and have been active for at least 60 million years.
C1 subfamily retains functional AATAAA motifs and is involved in polyadenylation, while C2 is active in hares and rabbits but absent in pikas.
Transcription of C SINEs is activated at the 16-cell stage of rabbit embryos.
Abstract
Short Interspersed Elements (SINEs) are small pieces of DNA that can move around in the genetic material of living things. They can modulate the genome function, e.g., induce hereditary diseases in humans and animals. In mammals, some SINEs have specific signals that help them make more copies of themselves. Researchers found a type of these genetic mobile elements called C SINEs in rabbits, hares, and pikas. More than one million copies of C SINE have been detected in the genomes of each of the five studied species of the order Lagomorpha. This SINE first appeared at least 60 million years ago. We discovered that C SINEs have different versions with unique features. By studying these features, we could see how these SINEs have been active in the evolution of these mammals. In particular, they found certain sequences in C SINEs that are important for making more copies of themselves.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —Russian Science Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · RNA Research and Splicing · Nuclear Structure and Function
1. Introduction
Short Interspersed Nuclear Elements (SINEs) are non-autonomous short (100–600 bp) retrotransposons that are transcribed by RNA polymerase III (pol III) [1,2,3]. SINEs are characteristic of the vast majority of multicellular organisms; a single family may reach 10^6^ copies per haploid genome. There may be one or more SINE families in a genome. Copies of a family share a common ancestor and exhibit clear nucleotide sequence identity, whereas members of a subfamily display even higher sequence similarity. Subfamilies arise through various mutations, including those that appear to enhance retrotransposition efficiency. New genomic copies are generated by reverse transcription, a process mediated by the reverse transcriptase encoded by partner Long Interspersed Nuclear Elements (LINEs) resident in the same genome. Although some SINEs originated from 7SL RNA or 5S rRNA, the majority are derived from tRNA genes; consequently, the 5′-terminal “head” of tRNA-derived SINEs retains homology to the corresponding tRNA. Both tRNA genes and their SINE derivatives contain an internal pol III promoter composed of 11-nt box A and box B motifs separated by 30–40 bp. The 3′-terminal “tail” of SINEs provides cis-acting signals recognized by the LINE-encoded retrotransposition machinery; in placental mammals, this machinery is predominantly the LINE-1 (L1) system, which requires a 3′-poly(A) tract on the SINE RNA.
SINEs may play an important role in the evolution of genes and genomes. Their insertions are scattered throughout genomes, including protein-coding loci. Although integration into exons or regulatory regions can disrupt gene function and cause disease [1,4,5], most intronic or intergenic insertions are selectively neutral or even beneficial. Fixed SINE copies frequently acquire functions in transcription [6,7,8], pre-mRNA splicing [1,9,10], or polyadenylation [11,12,13]. In contrast to RNA polymerase II (pol II) transcripts (mRNAs) containing SINEs, full-length pol III transcripts are short, equal or slightly longer than the template sequence [14,15,16]. Such SINE transcripts have been implicated in stress responses [17,18,19] and in transcriptional regulation via interaction with nuclear hormone receptors [20].
We previously classified tRNA-derived SINEs according to the presence (T^+^ class) or absence (T^−^ class) of an AATAAA polyadenylation signal and an adjacent pol III terminator (TCT_≥3_ or T_≥4_) located upstream of the 3′-poly(A) tail [21]. All 12 identified T^+^ families are restricted to placental mammals [3,21]. Pol III transcripts of T^+^ families were proven to be polyadenylated in an AAUAAA-dependent manner [22,23]. Previously, only transcripts synthesized by RNA polymerase II, namely mRNAs and many non-coding RNAs, were thought to undergo such polyadenylation. (The mechanisms of mRNA 3′-end processing and subsequent polyadenylation have been thoroughly studied to date [24,25,26,27]). Experimental analyses of mouse B2, jerboa Dip, and bat Ves elements demonstrated that, in addition to AATAAA, two auxiliary motifs, β (immediately downstream of box B) and τ (upstream of AATAAA), enhance pol III transcript polyadenylation. The τ motif of B2 is recognized by the CFIm complex [28], whereas Dip and Ves utilize polypyrimidine-rich τ signals [23]; similar polypyrimidine motifs are characteristic of four other T^+^ SINE families. Both β and τ signals contribute similarly to the polyadenylation efficiency and function independently of the remaining SINE sequence except AATAAA [23,29] and the resulting long poly(A) tail (>20 nt) is essential for efficient L1-driven retrotransposition [30,31,32,33]. In T^−^ SINE families (e.g., primate Alu), reverse transcription of SINE RNA is primed at a poly(A) tail in the genomic DNA unrelated to polyadenylation [1,34,35]; this mechanism can be referred to as T^−^ retrotransposition. The reverse transcription in T^+^ SINEs is rather primed at the poly(A) tail synthesized by polyadenylation; we refer to this mechanism as the T^+^ retrotransposition [5,22,36]. Polyadenylation also markedly increases the half-life of T^+^ transcripts [22,29], explaining their elevated abundance in virus-infected cells [37].
A dispersed genomic repeat in rabbits, named SINE C, was discovered and initially studied by the Hardisson group in the 1980s [11,38,39,40]. It is a highly repetitive and long (more than 300 bp) SINE containing, in particular, polypyrimidine motifs. It was classified as a T^+^ SINE on the basis of limited sequence data [21]. Transfection assays confirmed AATAAA-dependent polyadenylation of its pol III transcripts [23]. Later, analyzing the mobile elements in the complete rabbit genome, Yang et al. (2021) divided SINE C copies into two families: OcuSINEA and OcuSINEB [41]. The latter is more ancient and much more abundant; unlike OcuSINEA, it contains T^+^ SINE markers: AATAAA hexamers and transcription terminators (TCTTT) at the 3′-terminus.
Here we exploit high-quality genome assemblies of five lagomorph species, domestic rabbit (Oryctolagus cuniculus), eastern cottontail (Sylvilagus floridanus), woolly hare (Lepus othus), plateau pika (Ochotona curzoniae), and American pika (Ochotona princeps), to perform a comprehensive phylogenetic and functional dissection of the C SINE lineage. We identify evolutionarily distinct subfamilies, characterize their structural features, and exploit inter-specific presence/absence polymorphisms to infer recent retrotranspositional activity. We further assess C-element transcription during early rabbit development and use targeted mutagenesis to delineate cis-acting determinants of pol III transcript polyadenylation. This study extends our systematic investigation of T^+^ SINE retrotransposition mechanisms [5,23,36,42].
2. Materials and Methods
2.1. Genome Assemblies
Genomic data were downloaded from NCBI Genomes (https://www.ncbi.nlm.nih.gov/genome) (accessed on 30 June 2025). The following assemblies were used: European rabbit (Oryctolagus cuniculus) mOryCun1.1; Eastern cottontail rabbit (Sylvilagus floridanus) mSylFlo1.10; woolly hare (Lepus oiostolus) CXZ plateau pika (Ochotona curzoniae) NIBS_Ocur_1.0; American pika (Ochotona princeps) mOchPri1.hap1.
2.2. C SINE Identification in the Genome Assemblies
To detect C SINE copies in the rabbit genome (OryCun1.1), we used SSEARCH36 in an iterative pipeline (https://github.com/Toki-bio/sear2k/, accessed on 16 July 2025). This pipeline utilizes a methodical fragmentation of the genome into overlapping segments, identifies hits meeting minimum length (≥90% of query) and nucleotide similarity (≥65%) thresholds, and systematically explores the remaining non-hit regions to recover additional divergent copies until no new hits are identified. The final hits are then merged using bedtools merge [43], extended by 50 bp of flanking sequences, and extracted as FASTA sequences [44]. The query sequences included published consensus sequences of OcuSINEA subfamilies [41] and a combined OcuSINEB consensus generated in this study.
For non-rabbit lagomorphs, the initial dataset of extracted C SINE loci was separated into preliminary subfamilies. From these, 20,000 loci were randomly selected for detailed analysis using our SubFam script (https://github.com/Toki-bio/SubFam/, accessed on 25 July 2025). This script employs MAFFT [45] to arrange sequences by similarity and generates consensus sequences in batches of 50. These consensus sequences were then aligned against each other and known subfamilies to identify novel variants. The process was repeated iteratively, with newly identified variants used to re-extract and re-classify loci until no additional subfamilies were detected.
The subfamily assignment was performed using FaSort (https://github.com/Toki-bio/FaSort10, accessed on 25 July 2025), a software that employs a comparison approach to align all subfamily consensus sequences via ssearch36, selecting the best match based on bitscore. Due to the inherent variability in alignment scoring, each locus was evaluated ten times. Only loci with consistent assignments across all replicates were retained as high-confidence subfamily members; the remainder were classified as uncertain.
The presence or absence of SINE loci in genomes was determined by mapping ~200 bp flanking regions of each SINE-containing locus using BWA-MEM 0.7.19-r1273 [46]. The generated matches were then analyzed using various tools, including SeqKit v2.6.1 [47] and BEDtools v2.29.2 [43]. The presence or absence of SINEs was determined by a custom script (https://github.com/Toki-bio/SINE_orth_loc, accessed on 1 August 2025, [48]).
2.3. Mapping Expressed C SINE Copies
Raw RNA-seq reads from rabbit early developmental stages [49] were mapped to a rabbit reference genome using BWA MEM [46] with default parameters. Reads that overlap C SINE loci were extracted using samtools v1.22.1 [50]. For downstream analysis, we retained only reads that met the following criteria: the presence of a 5′-adapter sequence AAGCAGTGGTATCAACGCAGAGTACATGGG (allowing 1–2 repeats with up to 7 mismatches, identified via fuzzy search using agrep v.3.41 [51]) and a 30 bp match to the corresponding genomic C SINE locus (allowing up to 3 mismatches). Loci exhibiting > 10 reads in any given biological replicate were retained for analysis. The visualization of expression patterns was facilitated by R (https://www.R-project.org/, accessed on 2 September 2025).
2.4. Characterization of Individually Expressed C SINE Loci
C SINE copies with ≥98% sequence similarity were clustered using MeShClust v1.2.0 (parameters: --id 0.98 --delta 100 --align). The numerical data were normalized by library size using a scale factor of 1,000,000 counts per million (CPM). For each cluster, the following steps were taken. First, the summed normalized reads across replicates were calculated. Second, the top-expressed clusters across stages were identified. Third, stage-specific expression boxplots were generated in R. Clusters with more than three loci were used to build consensus sequences, while clusters with less than three loci were represented by their constituent sequences. The top 100 clusters by expression level were subjected to consensus sequence alignment to C1/C2 consensus sequences for subfamily classification and divergence analysis.
2.5. Plasmid Constructs
Construct C1-T (Figure S1) was created by cloning a PCR-amplified copy of C1 SINE (rabbit genome oryCun2 chr7:27788599-27788937) into the pGem-T plasmid [23]. For construct C1-C, both PASs were inactivated via T-to-C substitutions [23]. Deletions and nucleotide substitutions were introduced into the C1-T plasmid using the Phusion Site-Directed Mutagenesis Kit (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol. The plasmids designed for transfection were isolated using the Plasmid Midi Kit (Qiagen, Hilden, Germany) according to the manufacturer’s protocol.
2.6. Cell Transfection and Northern Blot Analysis
HeLa cells (ATCC, CCL-2) were grown to an 80%-confluent monolayer in 60 mm Petri dishes using DMEM with 10% fetal bovine serum. Cells on one plate were transfected by 5 μg of plasmid DNA mixed with 10 μL of TurboFect reagent (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer’s protocol. The cellular RNA was isolated 20 h after transfection using the guanidine-thiocyanate method [52] and further purified by RNase-free DNase I treatment. RNA samples (10 μg) obtained after each transfection were separated by electrophoresis in 6% polyacrylamide gel with 7M urea. RNA was then transferred from the gel onto a nylon membrane (GVS, Bologna, Italy) by semidry electroblotting at 3 V for 2.5 h. Four 28-nt deoxyoligonucleotides complementary to the 5′-part of the studied C1 copy were utilized as probes for Northern blot hybridization (Figure S1). The combined use of several 5′-end-labeled oligonucleotides resulted in increased hybridization signal intensity. The oligonucleotide mixture (4 pmol each) was labeled using γ[^32^P]ATP (25 μCi) and T4 polynucleotide kinase (10 U) by incubating the reaction mixture (50 μL) at 37 °C for 40 min. The reaction was terminated by the addition of EDTA to 25 mM and subsequent incubation at 70 °C for 10 min. The mixture was then diluted with 2 M ammonium acetate (350 μL), yeast tRNA was added as a carrier, and the oligonucleotides were precipitated with ethanol. The blots were hybridized with the labeled probe in 5 × SSC, 0.1% polyvinylpyrrolidone, 0.1% Ficoll, 0.5% SDS, and 0.1 mg/mL denatured salmon sperm DNA at 50 °C for 16–18 h. Washes were performed in 0.1% SSC and 0.1% SDS at 37 °C for 1 h. Hybridization signals were quantified by scanning the membranes in a phosphorimager (Image Analyzer Typhoon FLA 9000; GE Healthcare Bio-sciences, Uppsala, Sweden).
3. Results
3.1. Analysis of SINEs C in Lagomorph Genomes
The order Lagomorpha consists of two families: Leporidae (hares and rabbits) and Ochotonidae (pikas). We conducted a computer search for copies of C SINEs in the complete genomes of three leporids: European rabbit (Oryctolagus cuniculus), Eastern cottontail (Sylvilagus floridanus), and woolly hare (Lepus oiostolus), as well as of plateau pika (Ochotona curzoniae) and American pika (Ochotona princeps). In the genomes of O. cuniculus, S. floridanus, L. oiostolus, Oc. curzoniae, and Oc. princeps, 1,652,561, 1,544,495, 1,592,249, 1,124,544, and 1,269,333 copies of SINEs C were found, respectively. Thus, C is probably the most abundant SINE, outnumbering Alu, which has 1.1 million copies in the human genome [1] (Table 1).
Analysis of nucleotide sequences from the rabbit (O. cuniculus) genome showed that they belong to one of two distinct SINE subfamilies, which we designated as C1 and C2. These subfamilies correspond to OcuSINEB and OcuSINEA, identified by Yang et al. [41]. (We did not follow the nomenclature of these authors, as we consider it necessary to use the original name, C SINE, as a basis. Moreover, we will use the names of these two subfamilies for SINEs from genomes of species other than O. cuniculus, so the prefix ‘Ocu’ becomes inappropriate. The consensus sequences of C1 and C2 can be aligned with each other (Figure 1), demonstrating extra regions in C2 (two 13 bp long and one 7 bp long). The first of these 13 bp regions is part of a repeat that likely emerged by the duplication of a 19 bp sequence containing the B box of the pol III promoter. This resulted in the formation of an additional box B in C2; the left box is more likely functional, as it better matches the B box consensus sequence (Figure 1). In its 3′-terminal part, C1 contains three tandem AAATs forming potential AATAAA polyadenylation signals (PASs), as well as the pol III transcription terminator TCTTT (Figure 1). C2 does not have such sequences; thus, C1 but not C2 can be classified as a T^+^ SINE.
The low level of sequence similarity between C1 copies suggests that this SINE subfamily is very ancient; moreover, it exceeds the C2 subfamily by more than five times in terms of the number of copies in the rabbit genome (Table 1). Apparently, the C2 subfamily arose during evolution from certain C1 copies. We divided C1 copies into two variants, C1_a and C1_b (Figure 2), the former being significantly more ancient and abundant (Table 1). A very similar pattern was observed with C1 copies in the genomes of two other leporids, L. oiostolus and S. floridanus (Table 1). The genome of the plateau pika (Oc. curzoniae) also contains a large number of C1 copies; we divided them into C1_Pa, C1_Pb, and C1_Pc variants, where the P prefix indicates their origin from the pika genome. Judging by the divergence of the copies, the C1_Pa variant is the most ancient, and C1_Pc is the youngest of the three. The alignment in Figure 2 shows that the consensus sequences of each of the three C1 variants of the pika are very similar to C1_a and C1_b of the rabbit, although with certain specific features. These results indicate that C1 SINE originated a long time ago, before the division of lagomorphs into the Leporidae and Ochotonidae families, which occurred about 58 million years ago (Mya) [53]. In the genome of the American pika (Oc. princeps), whose lineage diverged from that of Oc. curzoniae 13 Mya, numerous C1_Pa and C1_Pb copies were found (Table 1), but C1_Pc copies were absent. Apparently, the C1_Pc variant arose after the divergence of the two pikes. It should also be noted that copies of C1_Pb showed greater average sequence similarity in Oc. princeps than in Oc. curzoniae (Table 1). This indicates retrotranspositional activity C1_Pb in Oc. princeps in a later evolutionary period.
We divided the C2 copies of the rabbit O. cuniculus into four variants (C2_a, C2_b, C2_c, and C2_d), listed in order of increasing average sequence similarity of their copies (Table 1). These variants correspond to the previously described [41] subfamilies OcuSINEA4, OcuSINEA3, OcuSINEA2, and OcuSINEA1, respectively. The division of C2 into variants is based mainly on the presence or absence of two characteristic 7- and 14-bp regions; however, particular single-nucleotide substitutions and indels also contribute to their identification (Figure S2). Analysis of C2 copies in the genomes of L. oiostolus and S. floridanus demonstrated the listed C2 variants in these leporids with a minor exception. The C2_a and C2_b variants can be clearly identified in these genomes (Figure S2), and the number of their copies is similar in the three leporids studied (Table 1). On the other hand, the C2_d variant is present in the S. floridanus genome but absent in L. oiostolus; apparently, C2_d was not amplified in the L. oiostolus lineage. Finally, C2_c in the genomes of L. oiostolus and S. floridanus is represented by the C2_c′ variant, which has noticeable differences from C2_c in the rabbit O. cuniculus; in particular, C2_c′ lacks the 7 bp sequence characteristic of C2_c (Figure S2). Thus, several slightly different variants of C2 SINE could have been amplified in the genomes of different leporid lineages.
3.2. Phylogenetic Affinity of C SINE Sequences to tRNAs
In 1985, Sakamoto and Okada [54], relying on the consensus sequence of only three rabbit C SINE copies [38], postulated a glycine tRNA origin for this element. We have now re-evaluated the evolutionary relationship between C SINE and specific tRNA species by using updated consensus sequences derived from multiple subfamilies of rabbit C, as well as from the plateau pika Oc. curzoniae, constructed from a substantially larger copy number. The goal was to determine which tRNA species could have given rise to the SINE C family. Consensus sequences of the rabbit C1 and C2 subfamilies and of the pika C1_Pa, C1_Pb and C1_Pc variants were aligned against the complete set of human tRNA sequences. (Using human rather than rabbit tRNAs for this purpose is perfectly acceptable, since tRNA sequences are highly conserved across placental mammals). The highest similarity scores between the head regions of C SINEs and individual tRNAs ranged from 62 to 67% (Figure S3). Although these values fall slightly below the 70% threshold previously employed in SINEBase curation [3], they nevertheless permit tentative conclusions. The consensus sequences of rabbit C1 and pika C1_P(a–c) exhibited the greatest resemblance to leucine tRNA (Figure S3), suggesting that these elements most probably originated from this tRNA species. Conversely, the rabbit C2 consensus displayed the highest similarity to glycine tRNA (Figure S3).
As noted earlier, the emergence of the C2 subfamily from C1 was accompanied by a duplication of the B box-containing segment, reducing the A–B distance from 42 nt in C1 to 33 nt in C2 (Figure 1). In leucine tRNAs this distance is 42–44 nt, whereas in glycine tRNAs it is only 31 nt; the difference is attributable to an extra loop present in leucine tRNA. Consequently, the leucine tRNA sequence aligns gaplessly with C1, whereas the B box shift caused by its duplication in C2 makes the 5′ region of this subfamily more compatible with glycine tRNA. Fixed random nucleotide substitutions may have further contributed to this shift, ultimately facilitating the rise in the successful C2 subfamily in leporids.
3.3. Analysis of Transcription Terminators
Transcription by RNA polymerase III terminates at T residues within T_≥4_ or TCT_≥3_ sequences, with longer terminators exhibiting greater strength and efficiency [16,36,55]. Consequently, such transcription usually results in truncation of the terminator sequence; thus, daughter copies of T^+^ SINEs are expected to possess less efficient or even non-functional, rudimentary terminators (T_≤3_ and TCT_≤2_). Our previous studies of the terminators in various SINE families (B2, Dip, Ves, and Can) suggested that the moderately efficient terminator TCTTT is relatively resistant to further truncation [5,36]. It was also found that old, long-established SINE copies in the genome can undergo elongation, and thus enhancement of their terminators, although the underlying mechanism remains unknown.
We analyzed C1 copies of the European rabbit genome to determine whether a similar relationship exists between the age of copies and the incidence of functional transcription terminators. It is known that the young copies of L1-mobilized SINEs have long poly(A) tails, whereas older copies exhibit tail shortening due to extensive deletions and nucleotide substitutions [1,30,35,36]. Poly(A) tail length was therefore used as a proxy for the relative age of C1 copies in the rabbit genome; this approach was previously applied to other T^+^ SINEs [5,36,42]. Samples of C1 copies with tails A_>20_, A_11–20_, and A_5–10_ corresponding to relatively young, intermediate, and old copies, respectively, were analyzed. The proportion of C1 copies carrying the moderately efficient terminator TCTTT increased from 12% in young copies to 34% in old ones (Figure 3A). Highly efficient terminators (TCT_>3_) were absent in relatively young C1 copies but were detected in 4% and 25% of intermediate and old copies, respectively (Figure 3A). Thus, the majority (59%) of old copies possessed functional rather than rudimentary transcription terminators.
A similar analysis was performed for C1_Pc variant copies from the pika Oc. curzoniae genome. Based on the average pairwise identity of C1_Pc copies, this is the youngest C1 variant in this genome (Table 1). As observed for rabbit C1, the frequency of highly efficient terminators (TCT_>3_) in C1_Pc copies (Figure 3B) increased dramatically among old copies with A_5–10_ tails (17%) compared to those with longer (A_>20_) tails (0.6%).
This result is consistent with our previous observations of T^+^ SINEs in mammalian genomes of other orders [5,36]. On the other hand, in all three C1_Pc samples, copies with the moderately efficient terminator TCTTT predominated (64–78%), whereas in rabbit C1 samples with A_11–20_ and A_>20_ tails, rudimentary terminators (TCTT) prevailed (58–64%) (Figure 3). This difference is most likely due to the C1_Pc subfamily being significantly younger than the rabbit C1 family analyzed. The latter has undergone many more retrotransposition cycles, leading to the reduction in numerous terminators. The high frequency of the TCTTT terminator among C1_Pc copies is consistent with our earlier observation of the relative resistance of the TCTTT sequence to truncation after pol III transcription [16,36].
3.4. Retrotranspositional Activity of C SINE
To estimate the historical retrotranspositional activity of C SINE in leporid evolution, we searched for C2 copies specific to the genomes of O. cuniculus, S. floridanus, and L. oiostolus. By pairwise comparison of these genomes, we identified insertions present in one species but absent from the orthologous loci in another species. The number of such lineage-specific copies ranged from 45 to 67 thousand per genome (Table 2), indicating substantial ongoing retrotranspositional activity of C2 SINE. The evolutionary lineages of these three species diverged 13–16 Mya [53,56]. Thus, the average integration rate of new copies subsequently fixed in leporid genomes can be estimated at 2.8–4.1 × 10^3^ copies/million years (My) (Table 2).
An analogous analysis was performed for the youngest variants of C1 SINE in two pika species. In Oc. curzoniae this variant is C1_Pc, whereas in Oc. princeps, which lacks C1_Pc, the youngest variant is C1_Pb (Table 1). We detected 29,740 and 34,714 lineage-specific C1 copies in the genomes of Oc. curzoniae and Oc. princeps, respectively (Table 2). Given that these species diverged 13.2 Mya, the emergence rate of C1 copies can be estimated as 2.2–2.6 × 10^3^ copies/My (Table 2).
3.5. Expression Analysis of C SINE in Early Rabbit Embryogenesis
A comprehensive study by Oomen et al. [49] recently examined the transcriptional activity of various transposable elements at the earliest stages of embryogenesis in mouse, pig, cow, rabbit, and rhesus macaque. The authors aimed to determine whether transcription of different TEs occurs during, before, or after embryonic genome activation (EGA)—the major developmental milestone that renders the embryo independent of maternal control. Regrettably, the study overlooked the rabbit C SINE transcription. Consequently, the SINE family represented by the largest number of copies in the rabbit genome was not analyzed, thus hampering cross-species comparison of SINE transcription dynamics.
To quantify RNA transcribed from C SINE (C1 + C2) by pol III, we re-analyzed the primary transcriptome-sequencing data for early rabbit embryos deposited by Oomen et al. [49] in GEO. The library preparation and sequencing protocols used distinguish pol III-initiated transcripts (starting within a SINE) from long Pol-II transcripts in which the SINE resides internally. Figure 4 shows the number of reads per million (RPM) mapping to C pol-III transcripts in oocytes, zygotes, and five early embryonic stages. Low-level transcription is detectable from the 16-cell stage onwards, reaching substantial levels at the morula stage. Thus, C SINE transcription by pol III begins after embryonic genome activation (EGA), which in the rabbit occurs at the 4- and 8-cell stages. The small number of C SINE transcripts observed at earlier stages most likely reflects maternal deposition during oocyte maturation rather than de novo transcription.
We next examined the relative transcription of C1 and C2 subfamilies. At all stages, C2 transcripts predominated: 59% in zygotes, increasing to 76% at the 16-cell stage, and reaching 94% at the morula (Figure 4B). This shift probably reflects the gradual decay of C1 transcripts and preferential activation of C2 transcription after EGA.
Then, individual C SINE copies were mapped back to the rabbit genome using the same read sets. Because C2 copies are highly similar, it was impossible to unambiguously assign C2 reads to specific loci. The mapped loci clustered into two groups by their nucleotide sequences (827 and 535 copies) with very similar expression patterns (Figure S4A), matching the overall profile shown in Figure 4.
In contrast, older C1 copies carry more mutations, allowing unambiguous mapping. Figure S4 illustrates expression patterns of representative individual C1 loci:
- -some are activated at the 16-cell and morula stages (Figure S4B);
- -others are active only at the 16-cell stage (Figure S4C);
- -surprisingly, several loci show high transcript levels in oocytes and zygotes that drop to near zero by the 16-cell/morula stages (Figure S4D) or remain constant (Figure S4E).
High oocyte levels most likely represent maternal storage, with subsequent degradation. Thus, individual C1 copies can exhibit expression patterns that deviate markedly from the dominant post-EGA activation scenario.
3.6. Mapping of Cis-Elements Required for C1 SINE Transcript Polyadenylation
The nucleotide sequences within C SINE that are required for the efficient polyadenylation of transcripts were identified experimentally for one copy of C1 in the rabbit genome. The plasmid carrying this copy (Figure S1) [23] was used to generate a series of deletion and substitution constructs (Figure 5A).
HeLa cells were transfected with resulting constructs, and total cellular RNA was analyzed by Northern blotting with radioactively labeled probe complementary to the first 113 nucleotides of the C1 transcript (Figure S1). Polyadenylated transcripts were detected as heterogeneous bands migrating above the primary transcripts of the C1 constructs (Figure S5). The effects of deletions and substitutions on polyadenylation efficiency are summarized in Figure 5B relative to the parental C1-T construct; the C1-C construct, in which both canonical PASs were inactivated by a T→C transition, served as a negative control.
A 9-nt deletion (Δ9) located 9 nt upstream of the first PAS did not affect transcript polyadenylation, whereas a longer deletion starting at the same position (Δ42) reduced the relative polyadenylation level to 45% (Figure 5). Further extensions of this deletion (Δ83, Δ117, Δ155, Δ184 and Δ216) did not cause additional decrease, indicating that the polypyrimidine-rich motif within the Δ42 deletion is critical for the transcript polyadenylation. We propose that this motif functions as the τ signal. Notably, the Δ42 deletion also removes the TGTA (highlighted in green in Figure 5A), which constitutes the core of the τ signal as in B2 [28]. We therefore hypothesized that TGTA might perform an analogous role in C1. However, mutational analysis yielded contradictory results: substitution of the tetramer together with its two flanking nucleotides (the subCTGTAA construct) decreased relative polyadenylation to 88%, whereas complete removal of the same segment (Δ9) had no effect (Figure 5). We therefore conclude that TGTA is not an essential component of the τ signal in C1.
The C1 sequence contains the CACCCATGT element (highlighted in blue in Figure 5A), which closely resembles the β signal of B2 [28]. Nevertheless, targeted deletions of this region (Δ11 and Δ22) had no adverse effect on polyadenylation (Figure 5B), indicating that it does not function as a β signal in C1. This is probably due to its distance (79 nt) from the B box; β signals are much closer to the B box in B2 and Ves SINEs (5 and 4 nt, respectively). We therefore examined the region adjacent to the B box in C1. The Δ32 deletion slightly reduced the polyadenylation efficiency (Figure 5), but four additional 8-nt deletions within the same region showed no significant effect (Figure S6). Thus, we were unable to identify a β signal in C1. It is likely that this SINE lacks a β signal, and the residual polyadenylation observed for the maximal-deletion construct (Δ216) may be mediated by an alternative mechanism, such as the secondary structure of the remaining RNA fragment.
4. Discussion
C SINE of the European (domestic) rabbit genome was discovered in the last century and was studied with the methods of that era on the basis of a very small number of copies [11,38,39,40]. Much later, Yang and co-authors analyzed the complete genome and concluded that the rabbit possesses two SINE families, which they designated OcuSINEA and OcuSINEB [41]. In the present work, we also carried out a genome-wide analysis of C SINE copies in the rabbit and conclude that this SINE forms a single family that is subdivided into two subfamilies, C1 and C2, corresponding to OcuSINEB and OcuSINEA, respectively. Our inference of a single family C was based on the fact that the consensus sequences of C1 and C2 can be readily aligned with each other (Figure 1), clearly indicating the relatedness of C1 and C2. (Yang and co-authors [41] did not report the alignment of OcuSINEA and OcuSINEB consensus sequences.) Subfamily C1 is significantly older and more abundant than C2. C1 sequences possess structural features characteristic of T^+^ SINEs (presence of PASs and transcription terminators), whereas C2 copies lack these features, indicating that they belong to T^−^ SINEs.
We compared various tRNAs with the 5′-terminal regions of subfamilies C1 and C2. The greatest similarity to C1 was shown by leucine tRNA, while in the case of C2 it was glycine tRNA. It appears that the C family, including its large and ancient subfamily C1, originated from leucine tRNA. The considerably younger subfamily C2, which clearly arose from C1, also most likely descends from this tRNA, despite the greater similarity of its 5′ region to glycine tRNA. The similarity to glycine tRNA most probably arose as a result of a 19-nucleotide duplication that included the pol III promoter B box. The emergence of a second B-box-like sequence, and crucially, the closer proximity of the B-box to the A-box (Figure 1), may have contributed to the transcriptional and retrotranscriptional activity of the C2 subfamily.
For the first time, C SINEs were analyzed in lagomorph genomes other than the European rabbit (Oryctolagus cuniculus). Two species belong to the Leporidae family (Eastern cottontail Sylvilagus floridanus and woolly hare Lepus oiostolus) and two species to the Ochotonidae family (plateau pika Ochotona curzoniae and American pika Ochotona princeps). The C family in the genomes of S. floridanus and L. oiostolus was subdivided into subfamilies C1 and C2, similar in copy number and average sequence identity to those observed in O. cuniculus (Table 1). The C2 subfamily in O. cuniculus can be divided into four variants (a, b, c, and d); the same is true for C2 from the genome of S. floridanus, whereas C2_d is absent from the genome of L. oiostolus. Moreover, the “c” variant in S. floridanus and L. oiostolus differs noticeably from C2_c of O. cuniculus, it was designated C2_c′ in the case of these two leporids (Table 1, Figure S2). The data obtained allow us to conclude that the C2 subfamily arose before the divergence of the lineages of O. cuniculus, S. floridanus, and L. oiostolus (more than 16 Mya). The same applies to the C2_a and C2_b variants, whereas C2_d arose later and therefore has proliferated only in the genomes of O. cuniculus and S. floridanus.
The genomes of pikas contain the C1 subfamily but lack C2. This clearly indicates that C1 arose before the split of Lagomorpha into Ochotonidae and Leporidae, which occurred about 58 Mya [53]. In contrast, C2 arose in the genome of the leporid ancestor no later than 16 Mya. The C1 subfamily in pikas differs slightly from that of the rabbit, so it was designated C1_P (Figure 2). In the case of Oc. curzoniae it could be divided into three variants, C1_Pa, C1_Pb and C1_Pc, whereas only two of them were found in Oc. princeps: C1_Pa and C1_Pb (Table 1). C1_Pa is the oldest, whereas C1_Pc is the youngest variant of the C1 subfamily in Oc. curzoniae. The C1_Pc variant appears to have originated following the divergence of the two pikes. Notably, C1_Pb copies exhibit higher average sequence similarity in Oc. princeps compared to Oc. curzoniae, pointing to more recent retrotranspositional activity of this element in the Oc. princeps lineage.
Similar to C1 of rabbits and hares, C1_P of pikas belongs to T^+^ SINEs. We analyzed the structure of transcription terminators in C1 copies of the rabbit and C1_Pc of Oc. curzoniae with long, intermediate, and short poly(A) tails. According to a number of studies, the length of poly(A) tails of L1-mobilized SINEs in mammals is inversely related to the age of their copies, since poly(A) tails shorten over time [1,30,36]. An increase in the proportion of copies with highly efficient transcription terminators (TCT_>3_) among copies with shorter poly(A) tails was observed for both C1 of the rabbit and C1_Pc of Oc. curzoniae (Figure 3). This is consistent with our previous observations on B2, Dip, Ves, and Can SINEs [5,36] and suggests that terminators or their rudiments in old SINE copies often elongate, although the mechanism of this process remains unclear. As a result of such elongation, full-fledged terminators are restored, which is necessary for the termination of transcription followed by polyadenylation of the resulting RNA.
The C2 subfamily is clearly much younger than C1 and could be expected to be retrotranspositionally active in late leporid evolution. In the present work, pairwise genome comparisons of O. cuniculus, S. floridanus and L. oiostolus were conducted to search for C2 copies that are present in one species but absent from the orthologous loci of another one. The identification of such copies indicated that the integration event was after the evolutionary divergence of these two species. Tens of thousands of such C2 copies were found (Table 2); with an account of the time of divergence of these leporid species (13.4–16.4 Mya), this allowed us to estimate the average emergence rate of new copies in their genomes at 2.8 × 10^3^–4.1 × 10^3^ copies/My. Previously, Yang and co-authors [41] demonstrated insertion variation for certain OcuSINEA (C2) loci between rabbit breeds; these data indicated the ongoing retrotranspositional activity of these SINEs.
We carried out a similar analysis for relatively young C1 variants in pikas: C1_Pc in Oc. curzoniae and C1_Pb in Oc. princeps. It turned out that 30–34 thousand new C1 copies had arisen in the genomes of each of these pika species after their evolutionary divergence, which occurred 13.3 Mya (Table 2). Thus, the average emergence rate of new C1 copies in pikas can be estimated at 2.2–2.6 × 10^3^ copies/My. The emergence rates of new C2 copies in leporids and C1 copies in pikas were close to those of Can SINE (2.6–9.2 × 10^3^ copies/My) according to our previous estimate for Caniformia genera [5]. Similar values were also obtained for Ere SINE (1.9–2.8 × 10^3^ copies/My) in Equus genomes (horses, donkeys, and zebras) [42]. Apparently, the rate of new copies emergence, which depends on the rate of SINE integration into the genome and the efficiency of their fixation in populations, is similar for different SINEs in different mammalian orders.
The emergence of the C2, a significantly younger subfamily than C1, in leporid genomes represents a vivid example of an evolutionary trend from T^+^ SINEs to T^−^ SINEs. Less pronounced examples of a similar evolutionary trend are Dip SINE in jerboa [36] and Can SINE in Caniformia [5]. In contrast, T^+^ SINE subfamilies follow T^−^ ones in the genomes of armadillos [57], Muridae and Cricetidae rodents [36], as well as horses and rhinoceroses [42]. Such competitive success of SINE subfamilies in evolution can stem from an “arms race” between mobile elements and the host. The host may develop new mechanisms for suppressing retrotransposition, while SINEs may change their structure to escape this control. Transitions from T^+^ to T^−^ retrotransposition mechanisms, or vice versa, can be such a way for SINE subfamilies to continue their amplification in the genome.
Using the data of Oomen et al. [49] on transcriptome sequencing of rabbit oocytes, zygotes and very early embryos, we assessed the level of C SINE transcripts synthesized by pol III. It turned out that the level of such transcripts becomes significant only in 16-cell embryos and, especially, in morulae, i.e., transcription of C SINEs is switched on shortly after EGA, the moment of the primary transcriptional activation of the genome in embryos (in rabbits, EGA occurs at the 4- and 8-cell stages). The dynamics of activation of C SINE transcription in embryogenesis turned out to be most similar to that of tRNA-related SINEs of the pig [49].
The number of C2 transcripts was only 1.4 times that of C1 transcripts in zygotes, while C2 transcripts predominated by 3.2 and 15.6 times in 16-cell embryos and morulae, respectively. This can be attributed to a significantly stronger activation of C2 transcription compared with C1 and/or to a degradation of C1 transcripts during the zygote to morula stages. This was confirmed by the analysis of transcripts of individual C1 copies: the transcription of certain C1 copies was activated at the 16-cell and morula stages (Figure S4B), and, conversely, the transcription level of many other previously active C1 copies dropped to zero at the same two stages (Figure S4D), which can be attributed to the degradation of these RNAs. Although analysis of individual C2 copies is complicated by their high similarity, the data on two large groups of C2 copies indicate their transcription activation at the 16-cell and morula stages (Figure S4A). Possibly, the much more pronounced activation of C2 transcription is related to the youth of this subfamily and, therefore, better preservation of A and B boxes of the pol III promoter in its copies. However, the close levels of C1 and C2 transcripts in zygotes indicate similar transcription levels of copies of both subfamilies during gametogenesis. B2 transcripts synthesized by pol III have been detected in testes and brain, as well as in tumor-derived cells [16]. It would be interesting to compare the levels of C1 and C2 transcripts generated by pol III in the same organs and in similar tumor cells of the rabbit; however, transcriptome sequencing data that would allow differentiation of pol II- and pol III-generated transcripts are currently unavailable.
The regions of this SINE that are critical for polyadenylation of its transcripts were identified by introducing deletions or substitutions into a C1 copy cloned from the rabbit genome and transfecting the resulting constructs into HeLa cells. In addition to PAS sequences, which are crucial for polyadenylation, a 35-nucleotide polypyrimidine motif located 16 nt upstream of the first PAS proved important in this regard. We previously demonstrated a significant contribution of similar polypyrimidine motifs to the polyadenylation of pol III transcripts for Ves, Dip, and Can SINEs [5,23]. The nucleotide sequence of polypyrimidine motifs that function as the τ signal can vary greatly. It is known that several different proteins can bind to polypyrimidine RNA sequences; however, our knockdown experiments have so far failed to demonstrate the involvement of these proteins in the polyadenylation of Ves transcripts [28].
The C1 sequence contains the CACCCATGT element, which resembles the β signal (ACCACATGG) of B2 [28], so we assumed it to be a β signal. However, deletion of this sequence did not affect the polyadenylation of C1 transcripts. This is probably due to its considerable distance (79 nt) from the B box of the pol III promoter (β signals in B2 and Ves are 4 or 5 nt from the B box). Deletions in C1 affecting sequences located immediately downstream of the B box also did not reduce the polyadenylation level. Thus, we were unable to find a β-signal in C1 and to completely deprive C1 transcripts of the ability to be polyadenylated by deletions. Similarly to C1, Can SINE from the dog genome uses a polypyrimidine motif as a τ-signal, and its deletion reduced polyadenylation to 40%, whereas deletion of other regions did not affect polyadenylation and failed to reveal a β signal [5]. Interestingly, Can and C1 proved to be the shortest (178 bp) and the longest (333 bp) studied T^+^ SINEs, respectively. It remains unclear how the transcripts of these two SINEs, unlike others, retain certain polyadenylation competence even after being deprived of all sequences downstream of the B box (except PAS and terminator). The 5′ (tRNA-related) parts of these SINEs can possess yet unknown structural features that promote polyadenylation of their transcripts.
We previously showed that AAUAAA-dependent polyadenylation of transcripts of various T^+^ SINEs significantly extends their lifetime in cells [58]. Most likely, this also applies to C1 transcripts. We believe that the polyadenylation of T^+^ transcripts represents an important strategy of L1-dependent SINE retrotransposition [5,36]. Apparently, the evolutionary success of C1 SINE, manifested in its huge abundance in the genomes of rabbits, hares, and pikas, is at least partly due to this ability of C1 RNA to acquire poly(A) tails upon completion of transcription.
5. Conclusions
More than one million copies of C SINE have been detected in the genomes of each of the five studied species of the order Lagomorpha. The genomes of rabbits and hares (Leporidae) contain two subfamilies of this SINE, namely C1 and C2. Copies belonging to the C1 subfamily are five times more common than C2 copies. Only the C1 subfamily is present in the genomes of pikas (Ochotonidae). The obtained results indicate that SINE C1 arose before the divergence of Lagomorpha into Leporidae and Ochotonidae, i.e., no less than 60 million years ago. The C2 subfamily emerged in the Leporidae lineage significantly later and still retains retrotranspositional activity. In contrast to C2, C1 copies belong to T^+^ SINEs, meaning that pol III- generated transcripts of its copies are potentially capable of polyadenylation. Experiments have shown that for effective polyadenylation of C1 transcripts, in addition to PAS (AATAAA), an extended polypyrimidine motif is required, which is characteristic of C SINE and several families of T^+^ SINEs in the genomes of other mammals. Analysis of early rabbit embryo transcriptomes demonstrated that pol III-mediated transcription of C SINE begins at the 16-cell stage and becomes active at the morula stage.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Deininger P. Alu elements: Know the SIN Es Genome Biol.20111223610.1186/gb-2011-12-12-23622204421 PMC 3334610 · doi ↗ · pubmed ↗
- 2Kramerov D.A. Vassetzky N.S. SIN Es Wiley Interdiscip. Rev. RNA 2011277278610.1002/wrna.9121976282 · doi ↗ · pubmed ↗
- 3Vassetzky N.S. Kramerov D.A. SINE Base: A database and tool for SINE analysis Nucleic Acids Res.201341 D 83D 8910.1093/nar/gks 126323203982 PMC 3531059 · doi ↗ · pubmed ↗
- 4Chen J.-M. Ferec C. Cooper D.N. LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease: Mutation detection bias and multiple mechanisms of target gene disruption Bio Med Res. Int.200620065618210.1155/JBB/2006/56182 PMC 151094516877817 · doi ↗ · pubmed ↗
- 5Kosushkin S.A. Ustyantsev I.G. Borodulina O.R. Vassetzky N.S. Kramerov D.A. Tail Wags Dog’s SINE: Retropositional Mechanisms of Can SINE Depend on Its A-Tail Structure Biology 202211140310.3390/biology 1110140336290307 PMC 9599045 · doi ↗ · pubmed ↗
- 6Ferrigno O. Virolle T. Djabari Z. Ortonne J.P. White R.J. Aberdam D. Transposable B 2 SINE elements can provide mobile RNA polymerase II promoters Nat. Genet.200128778110.1038/ng 0501-7711326281 · doi ↗ · pubmed ↗
- 7Su M. Han D. Boyd-Kirkup J. Yu X. Han J.J. Evolution of Alu elements toward enhancers Cell Rep.2014737638510.1016/j.celrep.2014.03.01124703844 · doi ↗ · pubmed ↗
- 8Policarpi C. Crepaldi L. Brookes E. Nitarska J. French S.M. Coatti A. Riccio A. Enhancer SIN Es Link Pol III to Pol II Transcription in Neurons Cell Rep.2017212879289410.1016/j.celrep.2017.11.01929212033 PMC 5732322 · doi ↗ · pubmed ↗
