MER57E3 transposable elements regulate gene expression in a human cell model of neural development
Michelle Almeida da Paz, Umut Yildiz, Minyoung Kim, Víctor Campos-Fornés, Marina Pinkasz, Thomas Dahlet, Kyung-Min Noh, Leila Taher

TL;DR
This study shows that MER57E3 transposable elements help regulate gene expression during human neural development, potentially impacting neurodevelopmental disorders.
Contribution
The study identifies MER57E3 as a regulatory element in neural development through epigenetic profiling and CRISPRi experiments.
Findings
MER57E3 elements show active histone modifications in human neural cell types.
MER57E3 is enriched near zinc finger genes and homeodomain motifs recognized by brain-specific transcription factors.
CRISPRi targeting MER57E3 leads to downregulation of neurogenesis-related genes PAX6 and NEUROG2.
Abstract
Long dismissed as mere genomic parasites, transposable elements (TEs) are now recognized as major drivers of genome evolution. TEs serve as a source of cell-type specific cis-regulatory elements, influencing gene expression and observable phenotypes. However, the precise TE regulatory roles in different contexts remain largely unexplored and the impact of TEs on transcriptional regulatory networks and contribution to disease risk is likely deeply underestimated. Using a multimapper-aware strategy, we systematically characterize the epigenetic profile of TEs in human cell systems modeling neural development. This analysis reveals that MER57E3, a primate-specific TE subfamily, exhibits strong enrichment for active, and absence of repressive, histone modifications across six cultured human neural cell types. MER57E3 copies are predominantly located near zinc finger genes and enriched for…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —https://doi.org/10.13039/501100002428Austrian Science Fund
- —Graz University of Technology
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · Genome Rearrangement Algorithms · Microtubule and mitosis dynamics
Background
Transposable elements (TEs) are mobile genomic elements that move –transpose– either via a 'copy-and-paste' mechanism, involving reverse transcription of RNA into DNA for insertion at new genomic loci, or a 'cut-and-paste' mechanism, involving direct excision and insertion. Constituting over half of the human genome [1], TEs are hierarchically classified into families and subfamilies, each representing copies derived from a single ancestral insertion. Although most TE copies have lost transpositional activity, leading to a vast accumulation of fossil sequences, a subset, including Alu, LINE-1 (L1), and SINE-VNTR-Alu (SVA) elements, retain their transpositional capacity [2].
Long overlooked as genomic dark matter, TEs are now understood to play a crucial role in eukaryotic genome evolution and architecture [3–5]. TEs can contribute to genomic diversity through various mutational events, including insertions, deletions, and structural rearrangements. In addition, TEs have been shown to be co-opted as cis-regulatory elements, thereby rewiring pre-existing regulatory networks [6–8]. Thus, the contribution of TEs to transcription factor binding site (TFBS) expansion is well documented [9–11], and this expansion appears to stem, at least in part, from the intrinsic capacity of TEs to acquire TFBSs, potentially to facilitate their own replication within host cells [12]. Remarkably, retrotransposition is hypothesized to have facilitated the convergent amplification of TFBSs for CTCF, a key regulator of chromatin insulation, in multiple mammalian lineages [13]. Moreover, a growing body of evidence supports the regulatory roles of TEs in physiological contexts like immune responses [14], early embryonic development [15, 16], and pregnancy [17, 18]. In addition, TE epigenetic dysregulation has been linked to genetic diseases [19], autoimmune diseases [20], neurodegenerative diseases [21], and cancer [22–24].
The need of studying TE epigenetic profiles is evident. However, precise characterization of TE epigenetic profiles is hindered by the limitations inherent in short-read next-generation sequencing (NGS) technologies, such as chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq). The repetitive nature of TEs poses significant challenges for standard bioinformatics [25, 26]. Specifically, the high sequence similarity of the TEs results in the inability to precisely trace the original location from which TE-derived reads originate, which poses a real bioinformatics challenge. As a consequence, TE-derived reads often map with equally high scores to multiple genomic locations, increasing the fraction of ambiguously mapping reads (“multimappers”). Despite the progress in the development of strategies to cope with sequences derived from TEs, there is no consensus on how bioinformatics analysis pipelines should tackle multimappers, which are often disregarded [25], resulting in the underrepresentation of TEs [27, 28]. In addition, this problem is magnified by the use of short-read and single-end NGS sequencing [29, 30]. Although novel long-read sequencing technologies are emerging [31], they still present some limitations in terms of lower throughput and higher error rates.
To characterize the epigenetic landscape of TEs, we leveraged T3E [32], our recently developed tool. T3E addresses the inherent uncertainty of multimapper read origin by distributing reads across multiple genomic loci, enabling robust quantification of epigenetic modifications at TE families and subfamilies. Applying T3E to ChIP-seq datasets for H3K4me3, H3K27ac, H3K9me3, and H3K27me3 from the ENCODE consortium, encompassing diverse neural cell types, provided a granular dissection of TE epigenetic modifications. Analysis revealed a subset of TEs, notably the MER57E3 subfamily, exhibiting striking enrichment for the histone modifications H3K4me3 and H3K27ac, associated with active transcription and accessible chromatin. This observation suggested a potential role for MER57E3 as a neurogenic regulator, a hypothesis that we experimentally validated through CRISPR interference (CRISPRi) experiments. CRISPRi-mediated silencing of MER75E3 elements resulted in downregulation of neural progenitor cell (NPC) state-maintaining transcription factors, including PAX6 and NEUROG2, and upregulation of neuronal differentiation-promoting genes.
Results
TE subfamilies present distinct cell-type specific epigenetic profiles
To explore the epigenetic profiles of TE subfamilies we used the T3E tool and publicly available human neural cell ChIP-seq datasets (Methods). Specifically, we assessed enrichment relative to input controls for H3K4me3, H3K27ac, H3K9me3, and H3K27me3 across 1,159 TE subfamilies in six cell types: astrocytes, neural progenitor cells (NPCs), neural crest cells (NCs), neurons, dorsal progenitor cells (DPCs), and neural stem cells (NSCs) (Additional file 2: Fig. S1). While H3K4me3, typically found at active promoters, and H3K27ac, enriched at active enhancers and promoters, are associated with transcriptional activation, H3K9me3, characteristic of constitutive heterochromatin, and H3K27me3, indicative of facultative heterochromatin and Polycomb-mediated gene repression, are linked to transcriptional silencing.
Enrichment levels were generally modest. Of the total TE subfamilies analyzed, 688 (59%) exhibited no enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC < 0.5\right)$$\end{document} , while an additional 171 (15%) showed weak enrichment ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.5\le {\mathrm{log}}_{2}\;FC<1$$\end{document} ) for a single histone modification in one cell type (Fig. 1A; Additional file 1: Table S1). Only 300 TE subfamilies featured more extensive enrichment, with 169 displaying weak enrichment for at least two histone marks or in at least two cell types, 114 exhibiting robust enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(1\leq\log_2\;FC\;<2\right)$$\end{document} and 17 showing extreme enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\;\geq\;2\right)$$\end{document} . Among these TE subfamilies, enrichment was most frequently observed for H3K27ac (171, 57%) and H3K9me3 (155, 52%), which were particularly prevalent in NCs (106 and 75 TE subfamilies, respectively). Furthermore, enrichment for H3K9me3 in NCs was detected for 106 TE subfamilies (35%), making it the most frequent finding overall and indicating that a large fraction of TEs are subject to transcriptional repression. Importantly, different TE subfamilies exhibited distinct epigenetic profiles across the cell types of interest. For example, while the UCON88 subfamily displayed robust or extreme enrichment for H3K27me3 in astrocytes, neurons and NPCs, for H3K4me3 in NCs and neurons, and for H3K27ac in NCs, the LTR26E subfamily showed enrichment for H3K4me3 in astrocytes, neurons, NPCs and NCs, and for H3K27ac in NCs. These findings were reproducible across biological replicates (Additional file 2: Figs. S2 and S3) and suggest that TEs are functionally diverse in a cell-type-specific manner.
Aiming to systematically characterize their epigenetic profiles, we next trained a self-organizing map (SOM) on the enrichment profiles of the 300 TE subfamilies with more extensive enrichment (Methods; Additional file 1: Table S2). Notably, the distribution of TE subfamilies across the SOM units was not uniform (Fig. 1B), with a median of 2, highlighting that specific epigenetic profiles are associated with varying numbers and classes of TE subfamilies (Fig. 1B and C). While 25 units were associated with a single TE subfamily, five units contained ten or more. Further analysis of the SOM revealed eight clusters (Fig. 1B), each reflecting varying degrees of TE activity and/or repression within specific neural cell types. The distinct composition of TE families within each cluster (Fig. 1D; Additional file 2: Fig. S4) highlights a strong link between specific TE sequences and unique epigenetic profiles.Fig. 1. Self-Organizing Map (SOM) analysis uncovers distinct epigenetic profiles across TE subfamilies. A Waffle chart depicting the number of TE subfamilies with varying enrichment levels. This chart visualizes the distribution of TE subfamilies across the dataset, highlighting those with the most pronounced enrichment. B A two-dimensional (2D) representation of the SOM, where each unit (node) represents a distinct epigenetic profile based on the enrichment of four histone modifications (H3K4me3, H3K27ac, H3K9me3, and H3K27me3) across six human neural cell types. Each unit is colored according to its assigned cluster, reflecting groups of TE subfamilies with similar epigenetic profiles. TE subfamilies are mapped onto their corresponding SOM units. C Codebook SOM planes. A codebook represents the epigenetic profile of a SOM unit. Each plane represents one of the four histone modifications (H3K4me3, H3K9me3, H3K27ac, H3K27me3) in each of the six human neural cell types (astrocytes, NPCs, NCs, neurons, DPCs, and NSCs) used as input for the SOM. The color intensity of each unit indicates the level of enrichment for that histone modification. D Heatmaps showing the enrichment profile of TE subfamilies in each cluster. Each row represents a TE subfamily, and each column represents one of the four histone modifications in each of the six human neural cell types. The color intensity reflects the enrichment level for each histone modification. Enrichment levels were categorized based on their \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log_2$$\end{document} fold change (FC) values as None \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\;<0.5\right)$$\end{document} , Weak \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(0.5\;\leq\log_2\;FC\;<1\right)$$\end{document} , Robust \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(1\leq\log_2\;FC\;<2\right)$$\end{document} , or Extreme \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\;\ge 2\right)$$\end{document}
Two clusters of TE subfamilies, Cluster 5 (29%) and Cluster 2 (9%), were defined by distinct repressive epigenetic signatures. Cluster 5, consisting of ERV1 (43%), ERVK (14%), L1 (13%) and ERVL-mammalian-apparent LTR retrotransposons (MaLRs) (8%) retrotransposons, displayed robust H3K9me3 enrichment in NCs and a weaker signal in neurons. This cluster also contained 50% of annotated SVAs. In contrast, Cluster 2, which contained similar proportions of ERV1 and ERVK but was also rich in unknown TE families (19%), was characterized by robust H3K27me3 enrichment in neurons and a weaker signal in DPCs. The observation that the same silencing mechanism—either H3K27me3 or H3K9me3 mediated—is applied to a broad and evolutionarily diverse range of TE subfamilies highlights a general strategy for maintaining genomic stability in specific neural lineages.
In contrast to the repressive profiles of Clusters 5 and 2, Cluster 7 (16%), composed mainly of unknown (35%) and DNA (33%) elements, showed robust H3K27ac enrichment in NCs and a weaker signal in astrocytes, a hallmark of active enhancers. A predominantly active profile was also observed for the elements of Cluster 3 (7%), which comprised unknown (38%) and DNA (29%) elements displaying robust H3K4me3 and weak H3K27ac enrichment in astrocytes, alongside weak H3K27ac in DPCs and NCs and weak repressive H3K27me3 in neurons. This profile suggests a strong promoter-like role in astrocytes that contrasts with weaker regulatory roles in other neural lineages.
Other clusters presented more complex, context-dependent epigenetic states. For example, Cluster 1 (13%), mainly consisting of ERV1 (16%) and L1 (12%) elements, exhibited robust repressive H3K9me3 enrichment in astrocytes coupled with a weak active H3K4me3 mark in neurons. Cluster 4 (18%), primarily composed of Alu (43%) and ERV1 (28%) elements, displayed weak enrichment for H3K27ac in astrocytes and a bivalent epigenetic signature in NSCs, with weak co-occurring enrichment for the active H3K27ac mark and the repressive H3K27me3 mark, suggesting an active regulatory function in astrocytes and a poised regulatory function in NSCs. Finally, Cluster 8 (3%), containing mostly of L1 (30%) and SVA (30%) elements, was characterized by a mosaic of active and repressive marks, co-occurring within some cell types, indicating highly context-dependent roles as regulatory elements.
Cluster 6, a small (14 TE subfamilies, 5%) but distinct group of TE subfamilies including the MER57E3 subfamily, which exhibited the maximum overall enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log_2\;FC=5.05$$\end{document} for H3K4me3 in neurons) and belongs to the LTR retrotransposon endogenous retrovirus (ERVs) ERV1 family, featured extreme enrichment for H3K27ac and H3K4me3 in NCs, and weaker but consistent enrichment for one or both of these active marks in DPCs, NPCs, neurons, and astrocytes, suggesting that these retrotransposons function as active cis-regulatory elements, likely enhancers or promoters, within these intermediate cell populations. Furthermore, the epigenetic profile of this cluster featured a sharp absence of active histone modifications in NSCs, representing an earlier progenitor state, while some members of the cluster demonstrated co-enrichment for H3K4me3 and H3K9me3 in differentiated astrocytes and neurons. This bivalent signature points to a transient activation within specific lineages, with subsequent silencing occurring as these cells mature, and would be consistent with these TE families contributing to the establishment of neural cell identity.
In summary, the distribution of TE subfamilies across the SOM revealed that the epigenetic landscape of TEs is highly variable depending on the specific cell type or state, providing a foundation for investigating the functional roles of TEs in neural development and function.
Endogenous retroviral TEs as potential active regulators in neural development
Cluster 6, comprising the subfamilies CRP1, HERV1_LTRb, LTR10B1, LTR10B2, LTR12E, LTR26E, LTR76, MER51C, MER51D, MER57E3, UCON25, UCON68, UCON88, and UCON89, displayed a striking epigenetic profile. Specifically, these TEs demonstrated strong enrichment for active histone modifications (H3K4me3 and H3K27ac) alongside depletion of repressive marks (H3K9me3 and H3K27me3) in DPCs, NCs, and NPCs. These findings indicate that these TEs, most of which are LTRs derived from ancient retroviral or retrotransposon activity [1] act as cis-regulatory elements contributing to specific stages of neural development or particular nervous system lineages.
Within Cluster 6, the MER57E3 subfamily displayed the most pronounced enrichment for active histone modifications, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} values ranging from 1.16 to 5.05 for H3K4me3, and from 0.51 to 2.56 for H3K27ac. In contrast, this subfamily showed minimal enrichment for repressive histone modifications, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} values varying from –0.30 to 0.07 for H3K9me3, and from –0.89 to 0.24 for H3K27me3. These findings suggest that MER57E3 may serve as an active regulator of gene expression in neural cells, consistent with reports for other medium reiteration frequency sequences (MERs) in immune system regulation [33], placental development [34], and in brain context [35].
The primate-specific MER57E3 subfamily comprises 252 copies within the human genome, predominantly located on chromosomes 19 (154 copies, 61%), 7 (42, 17%), and 21 (18, 7%). Most MER57E3 copies are located near gene promoters (Fig. 2A), often within the first intron, with 179 of them (71%) residing within 2 kbp of a transcription start site (TSS). Notably, many of these genes encode zinc finger proteins (ZNFs) (135, 75%), prominent regulators of transcription in mammalian genomes, an observation that has prompted the proposal of a co-evolutionary mechanism [36–39]. Consistent with the proximity of MER57E3 copies to ZNF genes, functional enrichment analysis revealed that genes near MER57E3 copies are associated with DNA-binding transcription repressor and activator activities (Additional file 2: Fig. S5). Furthermore, analysis of the MER57E3 sequences uncovered significant enrichment for binding sites of transcription factors known to play critical roles in neurogenesis and neuronal maturation. Specifically, [C/T]AATTA homeodomain motifs, the targets of gastrulation brain homeobox 1 (GBX1) and brain specific homeobox (BSX), presented the highest enrichments ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FCs$$\end{document} 2.29 and 2.18, respectively) (Fig. 2B, Additional file 2: Fig. S6). These findings further support the hypothesis that MER57E3 elements contribute to neural-specific gene expression regulation.Fig. 2. The MER57E3 subfamily regulates gene expression in cultured human neural cells.** A** Fraction of all 252 MER57E3 copies annotated according to genomic regions. Regions within 2 kbp of a TSS are annotated as promoters or proximal introns. MER57E3 copies preferentially integrate near the beginning of genes, either in the promoter region or within the first intron. B Top 10 transcription factor regulatory motifs found within MER57E3 copies ranked by the enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\right)$$\end{document} , compared to their expected frequency in the entire human genome as a background reference. Many of the enriched motifs embedded in MER57E3 copy sequences are recognized by homeobox transcription factors, including those that are brain-specific.** C** The heatmap shows the enrichment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\right)$$\end{document} of the MER57E3 TE copies in comparison to the respective ChIP-seq input control for each condition, comprehending a specific brain cell type and histone modification. Columns and rows are clustered using Euclidean distance and complete linkage. MER57E3 copy 1 is highlighted by an asterisk. Epigenetic profiles discovered across copies are suggested to be context-specific. D Inset view of genomic region containing MER57E3 copy 1 and the ZNF678 gene using web-app Integrative Genomics Viewer (IGV) shows peaks for active histone modifications (H3K4me3 and H3K27ac) and absence of repressive histone modifications (H3K9me3 and H3K27me3) in NPCs. MER57E3 copy 1 is a potential active regulator in cultured human neural cells
To investigate epigenetic heterogeneity among the 252 MER57E3 copies, we implemented a locus-level, multimapper-aware coverage strategy conceptually based on T3E (Methods). Specifically, sequencing reads were weighted proportionally to the number of potential loci of origins. This analysis corroborated the previously observed high H3K4me3 and H3K27ac enrichment in 36–71% of MER57E3 copies across cell types, but also uncovered a substantial proportion of copies with more complex epigenetic profiles (Fig. 2C, Additional file 1: Table S3). Notably, 6–14% of copies exhibited co-enrichment for H3K9me3 and H3K27me3, suggesting potential for both activating and repressive regulatory functions. Furthermore, 20–55% of MER57E3 copies displayed basal enrichment levels, indicative of likely regulatory inactivity.
NPCs presented the highest number of active MER57E3 copies, with 179 (71%) presenting an active epigenetic profile. Of these copies, 124 (69%) were enriched for active histone modifications but not repressive ones, with 21 specifically exhibiting co-enrichment for both active marks. These MER57E3 copies varied in length from 16 to 493 bp, and exhibited robust to extreme enrichment for H3K4me3 (log2FCs from 2.6 to 5.1) and H3K27ac (log2FCs from 1.0 to 3.3). Among the aforementioned 21 MER57E3 copies, MER57E3 copy 1 (chr1:227,564,046–227,564,512) displayed a notably pronounced co-enrichment for H3K4me3 and H3K27ac ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} s 4.09 and 2.03, respectively; FDR adjusted P-value \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$< 0.02$$\end{document} ) and no enrichment for H3K9me3 and H3K27me3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC s$$\end{document} –1.77 and –1.26, respectively) (Fig. 2D). Additionally, MER57E3 copy 1 overlaps or is located in the immediate proximity of ENCODE candidate cis-regulatory elements, including proximal enhancer-like regions (EH38E1428496 and EH38E1428495), and shows a H3K4me3 peak in an independent dataset from an NPC model (Additional file 2: Fig. S7). Moreover, this particular copy contains two instances of the TAATTA sequence, a motif presumably recognized by the homeodomain transcription factors GBX1 and BSX (chr1:227,564,288–227,564,293 and chr1:227,564,440–227,564,445). Remarkably, MER57E3 copy 1 demonstrated a consistently active epigenetic profile, marked by a strong enrichment for H3K4me3 and H3K27ac and a striking absence of repressive histone modifications H3K9me3 and H3K27me3, not only for NPCs, but also across all other cell types in the dataset, suggesting a potential for widespread activating regulatory activity and warranting experimental characterization.
Multimappers constituted 23–40% of the reads in our NPC ChIP-seq libraries, and their exclusion led to an overall under-quantification of TE subfamilies, including MER57E3 (Methods; Additional file 2: Fig. S8). Nonetheless, the enrichment profiles of the MER57E3 subfamily and MER57E3 copy 1 remained largely consistent with those obtained with our multimapper-aware approach (Additional file 2: Fig. S9). This likely reflects the relatively high divergence of the MER57E3 subfamily, and thereby rules out potential methodological artifacts associated with multimappers.
Targeted CRISPRi assay confirms MER57E3 as a neurogenic regulator
With the goal of validating the regulatory role of the TE MER57E3 subfamily in early human neural development, we conducted CRISPR interference (CRISPRi) experiments in NPCs with single guide RNAs (sgRNAs) targeting single and multiple MER57E3 copies (Additional file 1: Table S4, see Methods for details). A lacZ-targeting sgRNA was used as a non-targeting (NT) control. Briefly, CRISPRi machinery-expressing human induced pluripotent stem cells (hiPSCs, n = 2 biological replicates) were transduced with the designed sgRNAs, selected with puromycin, and differentiated into NPCs. Subsequently, we extracted total RNA for mRNA-sequencing (Fig. 3A, Methods). RNA sequencing of wild-type NPCs was performed to establish a baseline for the identification of CRISPRi-induced gene expression changes (Additional file 2: Fig. S10A).Fig. 3MER57E3 multi-copy and copy 1 targeted CRISPRi perturbations validates its role as neurogenic regulator.** A** Strategy used for targeted CRISPRi experiments for single and multiple MER57E3 copies. Lentiviruses encoding designed sgRNA sequences were used to infect CRISPRi machinery-expressing hiPSCs. Following sgRNA transduction, cells were selected using puromycin and differentiated to NPCs. On day 7, differentiated cells were harvested and total RNA was selected for mRNA library preparation and sequencing. Figure was created using BioRender.** B** Volcano plot shows differentially expressed genes for the CRISPRi-mediated silencing of MER57E3 copy 1 compared to the NT control (“NT-ctrl”). MER57E3 copy 1 silencing disrupts the regulation of genes related to early neural development. C The largest subnetwork of protein–protein interaction comprehends down-regulated (blue) and up-regulated (red) genes upon MER57E3 copy 1 silencing. D The members of the largest subnetwork are associated with nervous system development and neuronal differentiation, with axon guidance and axonogenesis presenting the highest confidence level of the predicted interactions (“strength”). Perturbation of MER57E3 copy 1 promotes early commitment to neural differentiation, preceding full maturation. Strength represents the ratio between the proteins annotated with a term to the number expected in a random network of the same size. E The heatmap shows the expression \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\right)$$\end{document} of the DEGs upon MER57E3 copy 1 silencing in all conditions (multi-copy and ZNF678 CRISPRi perturbations) compared to NT-ctrl. Columns and rows are clustered using Euclidean distance and complete linkage. Results reveal that the same trend was observed across all CRISPRi perturbations
CRISPRi-mediated silencing of MER57E3 copy 1 resulted in a distinct phenotypic alteration characterized by reduced proliferation (Additional file 2: Fig. S11), and the differential expression of 245 genes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\vert\log_2\;FC\vert\;\geq1\right)$$\end{document} and FDR adjusted P-value \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$< 0.05$$\end{document} ; Methods) compared to the NT control, with 62 (25%) down-regulated and 183 (75%) up-regulated genes (Fig. 3B, Additional file 1: Table S5). Four of those genes were also differentially expressed when comparing the wild-type NPCs to the NT control, and excluded from further analyses. Among the down-regulated genes, ZNF678 presented the strongest expression change ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -5.27$$\end{document} ), confirming effective on-target silencing by the CRISPRi system at the MER57E3 copy 1 (Fig. 4A and B) and suggesting that MER57E3 copy 1 acts as a regulatory element of ZNF678. Interestingly, CRISPRi-mediated silencing of MER57E3 copy 1 also resulted in the down-regulation of key early neural development transcription factors. Specifically, we observed decreased expression of PAX6 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -1.13$$\end{document} ), a master regulator of NPC identity [40, 41], and NEUROG2 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -1.42$$\end{document} ), a crucial transcription factor during neurogenesis that interacts directly with PAX6 [42]. Moreover, ID1, a transcriptional regulator that inhibits basic helix-loop-helix (bHLH) transcription factors and has been implicated in a variety of cellular processes, including self-renewal and stemness [43], was also downregulated ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -1.05$$\end{document} ). Conversely, many up-regulated genes were involved in specific neural circuitry formation, particularly axon guidance and synapse formation. These are exemplified by OLFM3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = 2.34$$\end{document} ), a secreted glycoprotein mediating cell adhesion and synaptic plasticity via extracellular interactions [44], and FOXD3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = 2.6$$\end{document} ), a transcription factor regulating neuronal differentiation and subtype specification [45]. To identify key regulators and hubs potentially modulated by MER57E3 copy 1, we constructed a protein–protein interaction (PPI) network (Methods). Analysis of this network revealed 13 distinct connected subnetworks, ranging in size from two to 36 genes. Surprisingly, ZNF678 did not have any direct interaction partners in the network. The largest subnetwork (Fig. 3C) comprised PAX6, NEUROG2, and ID1, and its members were associated with nervous system development and neuronal differentiation, among other biological processes related to anatomical structure and cell differentiation (Fig. 3D). PAX6 downregulation following CRISPRi-mediated silencing of MER57E3 copy 1 was confirmed via qPCR (Additional file 2: Fig. S12). These findings suggest that perturbation of this particular MER57E3 copy induces a transition from a plastic neural progenitor state to a more specialized, functionally mature state.Fig. 4. On- and off-target analysis confirms the effectiveness of our CRISPRi system. A A snapshot of the genomic region containing MER57E3 copy 1 and ZNF678 gene, created with the web-app Integrative Genomics Viewer (IGV). The data show a reduction in ZNF678 expression upon CRISPRi-mediated perturbations targeting MER57E3 copy 1 and ZNF678 for both biological replicates (clones 10 and 11). No expression of MER57E3 copy 1 was detected under any condition, including the non-targeting control (“NT-ctrl”). The inset focuses on MER57E3 copy 1, within the first intron of ZNF678. B Quantitative real-time PCR analysis confirms effective silencing of ZNF678 expression following CRISPRi-targeting of both MER57E3 copy 1 and ZNF678 promoter. Asterisks denote significant adjusted P-values: * < 0.05, ** < 0.01, *** < 0.001, and **** < 0.0001. C Genes in the neighborhood of predicted off-target sites were associated with binding-related molecular functions. Gene Ontology (GO) analysis was performed using GREAT, considering predicted off-target sites within 20 kbp windows, allowing for up to one (MM1), two (MM2), three (MM3), and four (MM4) mismatches in each single-guide RNA (sgRNA). Adjusted P-value threshold was set to 0.05. No significant GO terms were identified for off-target site predictions allowing for up to one mismatch (MM1). D Analysis of our RNA-seq data revealed that only 11 of the genes within 20 kbp of the predicted off-target sites were differentially expressed upon the perturbation of MER57E3 copy 1
An off-target prediction analysis (Methods) confirmed that the observed phenotype was most likely attributable to our intended target. Thus, the designed sgRNAs preferentially targeted other MER57E3 copies or closely related TE subfamilies, even under extremely lenient conditions (Additional file 1: Table S6; Additional file 2: Fig. S13). Moreover, the genes within a 20 kbp region of the predicted off-target sites were not associated with any relevant neural functions (Fig. 4C) and only 11 were differentially expressed (Fig. 4D). None of these genes or their known interaction partners were in the subnetwork containing PAX6, NEUROG2, and ID1 dysregulated upon CRISPRi-mediated silencing of MER57E3 copy 1.
To investigate whether the differentially expressed genes (DEGs) observed upon MER57E3 copy 1 perturbation were mediated by ZNF678 down-regulation, we performed independent CRISPRi experiments targeting the ZNF678 promoter. CRISPRi-mediated silencing of ZNF678 resulted in 39 DEGs relative to the NT control, comprising 19 (49%) down-regulated and 20 (51%) up-regulated genes (Additional file 1: Table S7, Additional file 2: Fig. S14). As anticipated, ZNF678 exhibited the greatest reduction in expression ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -4.16$$\end{document} ). Intriguingly, only 7 out of 19 (37%) down-regulated genes were also down-regulated upon perturbation of MER57E3 copy 1 (Additional file 2: Fig. S15). Notably, PAX6 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -0.87$$\end{document} ), NEUROG2 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -0.65$$\end{document} ), and ID1 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = -0.57$$\end{document} ) were not significantly down-regulated upon ZNF678 CRISPRi-mediated silencing. However, a substantial overlap was observed among upregulated genes, with 17 out of 20 (85%) up-regulated genes shared between MER57E3 copy 1 and ZNF678 perturbations, including OLFM3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC = 1.09$$\end{document} ). These data indicate that ZNF678 downregulation only partially mediates the gene expression changes observed following MER57E3 copy 1 perturbation, and that MER57E3 copy 1 likely influences gene expression through additional, ZNF678-independent mechanisms.
Finally, to assess whether gene expression changes following CRISPRi-mediated silencing of MER57E3 copy 1 were specific to its sequence and/or its genomic context, or indicative of a function common to other MER57E3 subfamily members, we performed additional CRISPRi experiments targeting three sets of multiple MER57E3 copies. These multi-copy assays generally resulted in relatively few differentially expressed genes (DEGs; 3 to 41), with a total of 56 (Additional file 1: Tables S8–11). Nevertheless, we observed a substantial overlap, with 68% of these DEGs also identified following CRISPRi-mediated silencing of MER57E3 copy 1. Furthermore, the overall direction of gene expression changes was highly consistent across all experiments, with no significant opposing changes detected (Fig. 3E; Additional file 2: Fig. S10B). Indeed, 81% of differentially expressed genes (DEGs) following CRISPRi-mediated silencing of MER57E3 copy 1 displayed the same directional changes in multi-copy CRISPRi perturbations (Methods), although statistical significance was not reached. For example, PAX6 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.41$$\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.62$$\end{document} ), ID1 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.23$$\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.65$$\end{document} ), and ANOS1 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.55$$\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-0.75$$\end{document} ) decreased in expression upon all perturbations, while OLFM3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.34$$\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.32$$\end{document} ) and FOXD3 ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.22$$\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.87$$\end{document} ) increased. The only two genes (LAMA4 and OLFML3) that were differentially expressed across all CRISPRi-mediated silencing experiments, including that involving MER57E3 copy 1, are extracellular matrix glycoproteins that were also up-regulated in wild-type NPCs. Six additional genes (FAM43A, DDX25, MEDAG, NTNG2, POLR3G, REM2) were up-regulated in three or more CRISPRi-mediated silencing experiments, but not in wild-type NPCs. With the exception of NTNG2, a netrin family member implicated in axon guidance, synapse formation, and neuronal migration [46], these genes possess general cellular functions, suggesting MER57E3 subfamily members may exert regulatory influence beyond neural development.
Together, these findings confirm a key role for MER57E3 in early neural differentiation, whereby perturbation of MER57E3 copies may prematurely induce NPCs to exit their progenitor state and accelerate commitment to neural differentiation before full maturation.
Discussion
Accumulating evidence underscores the role of TEs as a reservoir of functional regulatory elements, driving the evolution of mammalian regulatory networks, thereby impacting diverse cellular processes [47]. For instance, LINE L1 and L2 subfamilies have been shown to serve as evolutionary substrates for novel promoters and enhancers, both general (L2) and tissue-specific (L1) [48, 49]. However, a comprehensive understanding of the diverse functions, adaptive mechanisms, and evolutionary dynamics of TEs remains elusive, largely due to bioinformatics challenges arising from their repetitive nature. In this study, to investigate the potential regulatory functions of TEs during early neural development, we applied our recently introduced T3E strategy [32] to ChIP-seq datasets for four histone marks across six cultured human neural cell types generated by the ENCODE consortium. T3E addresses the challenges associated with sequencing read ambiguity by weighting read contributions inversely proportional to the number of potential genomic origins. Our computational findings were then verified using state-of-the-art genome editing and functional genomics techniques.
Safeguarding genomic integrity necessitates silencing mobile TEs, and this is done through the establishment and maintenance of repressive chromatin modifications. Indeed, the deposition of H3K9me3 has been shown to be critical for suppressing TE mobilization [50]. Our observation that a subset of TEs (113 subfamilies), encompassing SVAs, L1 elements, and specific LTR subfamilies of human endogenous retroviruses (HERVs) and MaLRs, exhibit enrichment for H3K9me3 or H3K27me3, hallmarks of heterochromatin formation and transcriptional repression, aligns with this paradigm. SVAs, as actively retrotransposing elements dependent on L1 element enzymatic machinery [51], are expected targets of such repressive control. Also, while the majority of L1 copies have been rendered inactive by accumulated mutations, a subset of them remains capable of retrotransposition [52] and need to be regulated. Finally, although most HERVs are immobile, their flanking LTRs can act as promoters and enhancers, causing the dysregulation of nearby genes [53, 54]. In contrast, our in silico analysis showed that the majority of TE subfamilies displayed only background levels of enrichment for the examined histone modifications, indicating that they likely possess minimal regulatory potential and pose little threat to the stability of the genome.
TE subfamilies with significant enrichment across the dataset were not exclusively associated with repressive histone modifications. Rather, we also detected a co-occurrence of both active and repressive histone modifications, suggesting that some TEs may function as promoters and enhancers in poised states. Specific TE subfamilies presented bivalent chromatin states. Thus, some UCONs were associated with H3K27me3, a hallmark of Polycomb-mediated gene repression, and H3K4me3, a histone modification typically associated with active promoters, in the same cell type. Meanwhile, certain young Alu Y elements presented a unique combinatorial enrichment pattern, involving simultaneously the repressive H3K27me3 and the active H3K27ac. Distinctly, this pattern was observed only for NSCs, which showed to be an exception when compared to the remaining neural cell types, presenting a more dynamic epigenetic profile which reflects its plasticity and ability to transition from multipotency to a range of neural lineages. While the association between the Polycomb repressive complex 2 (PRC2), which deposits H3K27me3, and TEs has been demonstrated for other eukaryotic species [55], there is no direct evidence showing the interaction of PRC2 and Alus. However, Alu elements have been shown to function as enhancers [56] and to act as self-cleaving ribozymes, with their activity being increased by EZH2, a core component of PRC2 [57]. This distinct chromatin pattern suggests that these human-specific Alu elements generally maintain a heterochromatic epigenetic profile, while also retaining a poised regulatory potential that may be activated depending on cellular context and developmental stage.
Several understudied ancient TE subfamilies, including Eulors, UCONs and MERs, were associated with active histone modifications, suggesting potential regulatory roles in human neural cell types. Some of these ancient TEs appear to frequently harbor active regions, which have been co-opted to regulate host gene expression [6, 58]. Eulor TEs comprise small subfamilies, ranging from 11 to 142 members, that exhibit high conservation across Euteleostomi (i.e., all bony chordates). Their structural similarities to DNA transposons, specifically the presence of terminal inverted repeats, has led to the hypothesis that Eulors are derived from this class of elements [59]. Research dedicated to Eulor TEs remains scarce, but their strong conservation, especially within noncoding regions, is consistent with a potential co-option as cis-regulatory elements [60]. Similar to Eulors, UCONs remain poorly characterized due to the absence of readily identifiable structural features [61]. Nevertheless, they have been implicated in the regulation of developmental genes [62–64] and placenta-specific enhancers [17, 65]. UCONs have been observed to co-occur with other TE subfamilies, frequently MERs, which raises the possibility of common ancestry [66]. Finally, MERs, remnants of ancient retroviral infections integrated into host genomes, have been subject to more extensive investigation than Eulors and UCONs. Their defining characteristics, including palindromic sequences and high GC content [67], facilitate the formation of secondary structures which have been implicated in microRNA (miRNA) activity [68]. Furthermore, accumulating evidence suggests that certain MERs function as cis-regulatory elements, notably enhancers, within mammalian genomes [69, 70]. In our cultured human neural cell dataset, MER57E3–an ancient ERV1 family LTR subfamily–stood out among all other TE subfamilies due to its exceptionally high overall enrichment for active histone modifications, specifically H3K4me3 and H3K27ac. This strong epigenetic profile suggests a potential active role in gene regulation within human neural cells. MER57E3 has already been identified as a relevant gene expression regulator in other biological contexts [71–74]. For example, MER57E3 overlaps with regions of high H3K27ac deposition in adult human testis tissue [71], suggesting an enhancer-like function for certain genes. Other studies have reported low DNA methylation levels for this TE subfamily in normal lung tissues [75], indicating chromatin accessibility. Additionally, MER57E3 has been shown to exhibit a promoter-like state across multiple human tissue and cell types [72], including a non-randomly H3K4me3 signal spreading from TSSs of certain protein coding genes, mostly ZNF genes, in human pluripotent stem cells [73, 74, 76]. Also, as previously reported by Zhang et al. [73], we noted that numerous MER57E3 copies are located downstream of the TSS of a ZNF gene, more specifically, within its first intron. This co-localization has prompted speculation that MER57E3 copies expanded alongside ZNF genes during primate evolution, approximately 80 to 100 million years ago (MYA), rather than retrotransposing independently as typical TEs [72, 73]. However, the functional significance of ZNF-associated MER57E3 copies, particularly their potential role as proximal cis-regulatory elements directly modulating ZNF gene transcription, and, hence, host gene expression, has yet to be definitely established.
Among all copies of the subfamily, MER57E3 copy 1 (chr1:227,564,046–227,564,512) stood out due to its strong co-enrichment for H3K4me3 and H3K27ac. The proximity (371 bp) of MER57E3 copy 1 to the TSS of ZNF678 and its enrichment for H3K4me3 raise the possibility that this element acts as an alternative promoter. However, this copy is also in immediate proximity of candidate pancreas, lung and neuron enhancers identified by the ENCODE consortium [77, 78], and harbors [C/T]AATTA sequence motifs, which are bound by brain-specific homeodomain transcription factors such as GBX1 and BSX [79, 80]. Homeodomain transcription factors are critical regulators of developmental and cell differentiation programs [81], and the conserved [C/T]AATTA homeodomain motif has been reported to be enriched within long-range enhancers and within a recently identified conserved class of cis-regulatory elements, range extenders (REX), which are essential for mediating long-range enhancer-promoter communication [82, 83]. Consequently, the presence of this motif within MER57E3 copy 1 would be consistent with it acting as either a long-range enhancer or a REX element.
To confirm that MER57E3 elements act as active regulators in neural development and to assess if they display functional redundancy, we conducted a series of CRISPRi experiments in NPCs. CRISPRi-mediated perturbation of MER57E3 copy 1 resulted in the down-regulation of transcription factors essential for stemness and NPC identity, including PAX6 and ID1 [84, 85]. In particular, PAX6 is robustly expressed in neural stem/progenitor cells throughout central nervous system (CNS) development and remains expressed in adult neurogenic niches, including the subventricular zone (SVZ) and the subgranular zone (SGZ) of the hippocampus [86]. Additionally, it has been recognized as a master gene regulator in brain development, controlling a large number of gene promoters in NPCs [87]. Consistently, the role of PAX6 has been related to neurodevelopmental conditions, such as autism spectrum disorder [88]. In addition, PAX6 interacts with other regulators of neurogenesis, including *NEUROG2 *[41, 89], which we also found down-regulated upon CRISPRi-mediated silencing of MER57E3 copy 1. In contrast, we observed up-regulation of genes related to neuronal differentiation (e.g., OLFM3 and FOXD3), suggesting that the perturbation of MER57E3 copy 1 pushes cells towards a defective neuronal differentiation, potentially disrupting neurogenesis and contributing to neurodevelopmental disorders.
MER57E3 copy 1 is located within the first intron of ZNF678 and its proximity to the TSS presents a technical challenge in distinguishing its precise regulatory impact from that of the host gene. To address this, we conducted an independent CRISPRi assay targeting the ZNF678 promoter. This assay led to transcriptional changes that mirrored those observed upon silencing MER57E3 copy 1, yet the effects were substantially less pronounced. Our findings therefore support the conclusion that MER57E3 copy 1 regulates ZNF678, but suggest a broader regulatory scope for MER57E3 copy 1.
Three additional CRISPRi experiments used sgRNAs to target distinct sets of multiple MER57E3 copies, and a final CRISPRi experiment combined the sgRNAs from these three sets to target a total of 34 MER57E3 copies. This non-overlapping design allowed us to draw more generalizable conclusions about the regulatory activity of the entire MER57E3 subfamily, rather than conclusions restricted to a single locus. In this setup, the limited availability of the dCas9-KRAB-MeCP2 complex, aggravated by the proliferative nature of NPCs, most likely resulted in a reduced repressive capacity at individual loci. Although the effects were generally more subtle, the gene expression trends were similar, confirming some degree of functional redundancy among MER57E3 copies.
It is worth noting that T3E was specifically designed to characterize the epigenetic profiles of TEs at subfamily level rather than for individual copies. This focus enables the identification of widespread, collective regulatory effects of TEs while addressing the inherent challenges posed by their repetitive sequences, thereby providing a more robust and biologically relevant understanding of their genomic impact. We acknowledge, however, that the method may be underpowered to detect enrichment for TE subfamilies where only a relatively small number of copies exhibit a distinct epigenetic signature. Despite this, our analyses revealed consistent epigenetic enrichments, even if subtle, across several large and highly abundant TE families. For instance, L1HS, L1PA2, and L1PA3 subfamilies showed borderline enrichment for H3K9me3 in NCs, while AluYa5 and AluYb8 displayed H3K27ac and H3K27me3 enrichment in NSCs. Furthermore, SVA A-F subfamilies were enriched for H3K9me3 in NCs, and neurons, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${log}_{2}\;FC$$\end{document} ranging from 1.11 to 1.54. These findings suggest the presence of sequence features that facilitate distinct regulatory functions, explaining their pervasive role in shaping transcriptional networks.
Taken together, our study not only underscores the significance of TEs in shaping neural transcriptional networks but also specifically highlights elements of the MER57E3 subfamily as critical regulators of neural progenitor cell fate and differentiation.
Conclusions
Our study establishes a framework for investigating the cis-regulatory potential of TEs, highlighting cell-type specific epigenetic profiles across TE subfamilies and individual copies, which may vary depending on genomic context. Furthermore, CRISPRi coupled with genome-wide gene expression analysis revealed that a subset of MER57E3 elements function as regulators in neural development. This activity may hold implications for the pathogenesis of neurodevelopmental disorders.
Methods
Publicly available ChIP-seq neural cell datasets
Raw data (FASTQ files) for selected ChIP-seq libraries (Additional file 1: Table S12) were downloaded from the ENCODE Project data repository (https://www.encodeproject.org/about/data-access/, last accessed on March 2025, [77, 78]).
Repeat annotation
Repeat annotation was retrieved from the RepeatMasker track of the UCSC Genome Browser for the human (GRCh38/hg38) genome assembly. Non-TE repeats, low complexity sequences, simple repeats, satellite DNA, and RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA) were filtered out. TE subfamilies were defined based on the “repFamily” field of the RepeatMasker track [90, 91], totaling 1,159 TE subfamilies. Overlapping or adjacent annotations for the same TE subfamily were merged.
The oldest clade in which a TE subfamily was active was determined using the Dfam web browser [92] (“Clades” column).
Quality control and ChIP-seq read mapping
Quality of raw ChIP-seq sequencing libraries and their respective input controls was assessed using FASTQC (v0.11.9, [93]). Low quality reads were trimmed using Trimmomatic (v0.39, [94]). Bwa mem (v0.7.17, [95]) with the “-a” parameter was used to map reads against the human (GRCh38/hg38) genome assembly and report all mappings. Duplicate reads were filtered out using PICARD (v2.24.0, [96]). Unmapped reads and read mapping to the mitochondrial chromosome and non-chromosomal scaffolds were excluded from further analyses using SAMtools (v1.10, [97]) and BEDtools (v2.27.1, [98]).
TE subfamily enrichment analysis
Enrichments for TE subfamilies were computed with T3E (v1.1, https://github.com/michelleapaz/T3E, [32]). Additional tools were used in the background, such as SAMtools and BEDtools (same versions specified above) and BEDMAP (v2.4.37, [99]). Large sequencing libraries were randomly down-sampled to 20 million reads (default) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log_2$$\end{document} fold-changes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\right)$$\end{document} and empirical P-values for TE subfamilies were obtained using N = 20 iterations.
T3E accounts for mapping ambiguity by inversely weighting sequencing read contributions based on their potential genomic origins during enrichment calculation. For robustness, rather than quantifying enrichment for individual TE copies, T3E computes enrichment at subfamily or family level. Importantly, unlike consensus-based methods, which map reads solely to a consensus sequence and consequently perform suboptimally for older, more divergent TEs, T3E maintains genome-wide read mapping, ensuring that uniquely mapping reads also contribute to the final enrichment statistics.
Locus-level TE enrichment analysis
Enrichment analysis of TE at a locus-level was performed by using an adapted version of T3E [32], designed to tackle reads mapping to multiple genomic locations (multimappers). Briefly, similar to the approach taken by T3E, the contribution of a read mapping to the coverage of a genomic position was weighted based on the number of mappings for that read across the entire genome. Specifically, the contribution was inversely proportional to the total number of mappings of the read. To account for differences in sequencing depth, the multimapper-aware coverage was normalized by dividing by the total size of the corresponding ChIP-seq library. A multimapper-aware nucleotide-level coverage was computed for each genomic position \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C(n) = \sum_{r_{i} \in R_{n}} \frac{1}{M_{r_{i}}},$$\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_n$$\end{document} is the set of reads that map across the genomic coordinate of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_{ri}$$\end{document} is the number of mappings for read \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r_i$$\end{document} . Coverage was normalized to the total number of reads mapping to the chromosome containing \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} .
Next, the multimapper-aware coverage in BED format was used as input for MACS2 (v.2.2.7.1, [100]). Specifically, peak calling was carried out with the “callpeak” argument on the ChIP-seq library using the corresponding input control for background noise correction. The effective genome size was set to 2.7 Gb (–g hs) and the analysis was performed in the default “narrowPeak” mode. Peaks with a false discovery rate (FDR) equal to or lower than 0.05 were considered significant. Repeat annotations overlapping with significant peaks were considered enriched.
Epigenetic profile characterization
Enrichment levels for TE subfamilies were discretized as follows: i) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left(\log_2\;FC\;<\;0.5\right)$$\end{document} : none; ii) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.5\;\leq\;\log_2\;FC\;<1$$\end{document} : weak; iii) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\;\leq\log_2\;FC\;<2$$\end{document} : robust; and iv) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log_2\;FC\geq2$$\end{document} : extreme.
The discretized TE epigenetic profiles of the TE subfamilies exhibiting robust or extreme enrichment for at least one histone modification and cell type, or weak enrichment for at least two histone modifications or a single histone modification across two or more cell types were clustered using the som() function from the kohonen R package v3.0.12 [101] on a hexagonal 10 × 10 grid with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$rlen\;=100\cdot302$$\end{document} . TEs with no discernible enrichment were excluded to ensure robust pattern detection and clear biological interpretation. Initial codebooks were defined using linear initialization, i.e., weight vectors were spanned by the eigenvectors corresponding to the largest two eigenvalues of the input data using the “somInit()” function of the aweSOM R package v1.3 [102] with parameter "method = pca".
SOM units were clustered based on both their codebook similarity and topological proximity using a hierarchical clustering approach. Both distance components were normalized by their respective maximum values to ensure comparability. The combined distance was calculated as a simple weighted sum of the two distances. The weight for codebook similarity was set to 0.7 and the weight for topological proximity was set to 0.3. Hierarchical clustering was then applied to the calculated combined distance matrix using Ward's D2 method. The number of clusters \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} was decided based on the Silhouette method, testing values between 6 and 15.
Genomic annotation of TE copies
Genomic annotation of TE copies was performed using the “annotatePeak()” function of the ChIPseeker R package (v1.34.1, [103]). Genomic regions within 2 kbp upstream or downstream of the TSS of a gene were designated as promoter-associated. Annotations were derived from the TxDb.Hsapiens.UCSC.hg38.knownGene and org.Hs.eg.db databases.
Motif enrichment analysis
Transcription factor binding motif enrichment was assessed using the "calcBinnedMotifEnrR()" function of the monaLisa R package (v1.4.0, [104]) and position weight matrices (PWMs) from the JASPAR2020 motif database [105] for vertebrates. The entire human genome assembly (BSgenome.Hsapiens.UCSC.hg38, [106]) served as background.
Putative transcription factor binding sites within MER57E3 copy 1 were identified using the “matchPWM” function from the Biostrings R package v2.66.0 [107] with a minimum score of 90%.
Design of sgRNA sequences
Five independent CRISPRi experiments were designed to assess whether MER57E3 elements act as active regulators in neural development and to determine if they display functional redundancy. One experiment specifically targeted MER57E3 copy 1, using three single-guide RNAs (sgRNAs) to maximize silencing efficiency at this locus. In three additional experiments, referred to MER57E3 sets 1, 2, and 3, a single sgRNA was used to target multiple MER57E3 copies (18, 10, and 7 copies, respectively). A final, “combined” experiment incorporated all three sgRNAs from sets 1 to 3, collectively targeting a total of 34 MER57E3 copies, with two of the sgRNAs having a single overlapping target locus.
The sequences of the corresponding sgRNAs were designed using CRISPR-TE [108] on the human reference genome assembly (GRCh38/hg38) (Additional file 1: Table S4). The criteria for sgRNA selection were designed to ensure the high specificity and efficiency of the CRISPRi system. Thus, sequences were prioritized with no predicted off-target sites annotated as other TE subfamilies, no predicted or minimum number of off-target sites not annotated as TEs, and among them, those with the highest on-target scores. For the CRISPRi experiments targeting multiple MER57E3 copies, sgRNAs were designed to target a large number of copies enriched for active histone modifications in NPCs; among possible choices, we selected sgRNAs with high on-target scores and low predicted off-target activity.
NT control and ZNF678 promoter sgRNAs (Additional file 1: Table S4) were designed based on the work by Sanson et al. [109].
Sense and antisense sequences of the designed sgRNAs were extended with overhang sequences required for cloning.
Cloning of designed sgRNAs
The sgRNA sequences were cloned into a modified CROP-seq lentiviral backbone [110] via BsmBI restriction enzyme digestion and oligonucleotide annealing. For this, 1 µl of each sense and antisense oligonucleotide (100 µM) was mixed with 10 × T4 ligation buffer (NEB) and 5 units of T4 PNK (NEB). The reaction was incubated for 45 min at 37˚C, 3 min at 95˚C and then cooled down at a rate of 0.1˚C per second until reaching 22˚C to facilitate phosphorylation and annealing. For ligation, BsmBI digested plasmid was mixed with 1 µl of oligo duplex (1:200), 10 × T4 ligation buffer (NEB) and T4 ligase (NEB). The reaction was incubated for 16 h at 16˚C and then 10 min at 65˚C for the inactivation of the enzyme. After ligation, plasmid amplification was performed in chemically competent Stbl3 bacteria and constructs were verified by Sanger sequencing. The confirmed sgRNA plasmids were used for subsequent lentivirus production.
Lentivirus production
Lentivirus production was carried out using a second-generation lentiviral system and HEK-293 T cells for packaging. Upon 2 h incubation with Chloroquine-supplemented media (High-glucose DMEM (Gibco) supplemented with non-essential amino acids (NEAA, Gibco, 1x), 10% FBS (Gibco), GlutaMAX (2 mM), penicillin/streptomycin (100 U/ml) and 25 µM Chloroquine (Sigma-Aldrich)), 80%-confluent HEK-293 T cells in 150 mm plates were co-transfected with 37.5 µg gRNA lentiviral plasmids, 20 µg viral packaging proteins plasmid (Pax2, Addgene #12,260) and 9 µg VSV-G envelope proteins plasmid (Addgene #12,259,) using a calcium phosphate transfection kit (Takara Bio). Following 16 h of incubation at 37 °C with 5% CO2, the medium was replaced with collection media (High-glucose DMEM (Gibco) supplemented with non-essential amino acids (NEAA, Gibco, 1x), 10% FBS (Gibco), GlutaMAX (2 mM), penicillin/streptomycin (100 U/ml) and 1 mM sodium butyrate (Sigma-Aldrich)) and incubated for another 30 h. The viral particle-containing media was then collected, cleared of debris, filtered and concentrated using ultracentrifugation (107,000xg, 4 °C, 2 h). The viral particles were resuspended in cold phosphate-buffered saline (PBS) and stored at −80 °C until needed. All procedures were done in S2 biosafety conditions.
Lentivirus titer calculation
To estimate functional viral particle concentration, lentivirus solutions were titrated by serial dilution and used to infect hiPSCs seeded in 96-well Nunc™ culture plates (5000 cells/well). Two human induced pluripotent stem cells clones, 10 and 11, were derived from peripheral blood mononuclear cells (CESCG-295, kindly provided by Dr. Michael Snyder, Stanford University, reference numbers 29904, 30064). Infection was allowed to progress for 24 h, followed by selection with 1 µg/ml Puromycin for 24 h and 2.5 µg/ml Puromycin for an additional 24 h. CellTiter-Blue assay (CTB, Promega) was next used to assess cell viability, following manufacturer’s instructions. Fluorescence was acquired in a plate reader (Tecan, 560 nm excitation/590 nm emission). Lentivirus titer was calculated using the formula indicated below as Transducing Units per ml (TU/ml), normalizing signal to non-infected (0% viability) and no Puromycin (100% viability) controls:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{Titer}\;}\left[\frac{TU}{ml}\right]\;=\;\frac{\%\;of\;relative\;reporter\;signal\;(normalized\;to\;controls)\;\times\;cell\;number\;at\;transduction\;}{total\;reaction\;volume\;\times dilution\;factor}.$$\end{document}CRISPRi-hiPSC line generation, selection and validation
For CRISPRi experiments, clonal hiPSC lines were employed that were engineered to constitutively express a nuclease-deficient Cas9 protein fused to a strong bipartite repressor (KRAB-MeCP2) [111] from the CLYBL safe-harbor locus. These hiPSCs were infected with gRNA-encoding lentiviruses at a multiplicity of infection (MOI) of 0.3 for 24 h. Transduced cells were next selected for 24 h with 1 µg/ml Puromycin and 100 µg/ml G418, followed by an additional 48 h of 2.5 µg/ml Puromycin and 100 µg/ml G418 treatment. Cells were next expanded and used for subsequent experiments or differentiations.
Differentiation of clonal hiPSC lines to neural progenitor cells (NPCs)
Human induced pluripotent stem cells expressing dCas9-KRAB-MeCP2 were maintained under serum-free conditions in E8 flex medium (Gibco, A2858501) on vitronectin-coated (Thermo Fisher Scientific, A14700) plates at 37 °C with 5% CO_2_. During passaging, cells were washed twice with PBS and incubated in a Versene solution (Gibco; 15,040,066; 3 min at 37 °C) to release colonies from the plate. One-tenth of the cell culture volume was transferred to a vitronectin-coated plate with E8 flex medium supplemented with 10 µM Rho kinase inhibitor Y-27632 (ROCKi; Abcam; Ab120129) and 100 units/ml penicillin/streptomycin (Gibco; P4458). The medium was replaced every other day with fresh E8 flex medium and cells were maintained until reaching \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 80\%$$\end{document} confluency. Following gRNA transduction, cells were selected with 1.3 μg/ml puromycin (Invivogen; Ant-pr-1) at day 7. For the differentiation of hiPSCs to NPCs, a monolayer differentiation protocol was employed following the SMADi Neural Induction Kit instructions (Stemcell Technologies; 0858). Briefly, transduced hiPSCs were washed twice with PBS and dissociated by Accutase treatment (37 °C, 5 min). For each differentiation, ~ 400,000 hiPSCs were plated onto poly-L-ornithine (15 µg/ml; Sigma-Aldrich; P4957) and laminin (10 µg/ml; Roche; 11,243,217,001) coated 24-well tissue culture plates in the differentiation medium supplemented with 10 μM ROCKi. Cells were evenly spread and maintained at 37 °C with 5% CO_2_. Medium was replaced every day with fresh differentiation induction medium. Detailed information can be found in the manufacturer’s protocol (https://cdn.stemcell.com/media/files/manual/10000005588-MAN_04.pdf).
mRNA library preparation and sequencing
Differentiated cells were harvested at day 7 of differentiation. Subsequently, total RNA was extracted using QIAwave RNA Mini kit (Qiagen). Then, 400 ng of total RNA was used to select mRNA and convert to cDNA using the NEBNext Poly(A) Magnetic Isolation Module (New England BioLabs, E7490) and sequencing libraries were prepared using the NEBNext Ultra II RNA Library Preparation Kit for Illumina (New England BioLabs, E7770). Each library was quantified on a Qubit and their quality verified on a Bioanalyzer using the Agilent High Sensitivity DNA Kit. Libraries were then pooled and sequenced in single end 1× 100 bp on a Illumina NextSeq 2000.
Analysis of mRNA-sequencing
The quality of raw mRNA-sequencing samples was assessed using FASTQ (v0.11.9, [93]). Reads shorter than 100 bp were removed, and bases with a Phred quality score below 30 were trimmed using Trimmomatic (v0.39, [94]). STAR (v2.7.11, [112] was used to map the reads against the human reference genome assembly (GRCh38/hg38) with parameters “–runMode genomeGenerate” and “–sjbdGTFfile”. The sequence of the human reference genome assembly was obtained from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.primary_assembly.genome.fa.gz (last accessed on March 2025). GENCODE Release 46 [113] (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gff3.gz; last accessed on March 2025) served as the source for gene annotation. Gene expression quantification was carried out using Subread featureCounts (v2.0.3, [114]) with default parameters.
Differential expression analysis was performed with the DESeq2 (v1.38.3, [115]) R package. Specifically, the “DESeq()” function was used with the default options to compare all targeted CRISPRi perturbations (MER57E3 copy 1, ZNF678, sets 1, 2, 3 and combined for MER57E3 subfamilies) to the NT control. Gene symbols were retrieved from R Ensembl BioMart database (v2.54.0, [116]). P-values were corrected for false discovery rate (FDR) using the Benjamini–Hochberg method [117]. Genes with an FDR lower than 0.05 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log_2\;FC$$\end{document} lower than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-1$$\end{document} or greater than 1 were considered differentially expressed. Volcano plots and heatmaps were generated using the EnhancedVolcano (v1.16.0, [118]) pheatmap (v1.0.12, [119]) R packages, respectively.
Mapping BAM files were converted to bigWig files using the deepTools (v. 3.5.0, [120]) bamCoverage function with parameters “–normalizeUsing CPM and –binSize 1”.
Quantitative real-time PCR (qPCR) experiments
Extracted total RNA was reversibly transcribed into cDNA using the High-Capacity cDNA Reverse Transcription Kit (Thermo Fisher Scientific) following the manufacturer’s instructions. Quantitative real-time PCR was performed using the Power SYBR® Green PCR Master Mix (Thermo Fisher Scientific) on StepOnePlus Real-Time PCR system (Applied Biosystems). Primers for qPCR reactions targeting ZNF678 (forward pair 1: 5’-CACCCCGAAAGCCGGAAAA-3’; reverse pair 1: 5’-ATGCCAGTAGTCCCTGTAATCT-3’; forward pair 2: 5’-TTACAGGGACTACTGGCATTCA-3’; reverse pair 2: 5’-CAGCTTCTATGTCTTGTCAGTGT-3’) and PAX6 (forward: 5’-TGGGCAGGTATTACGAGACTG-3’; reverse: 5’-ACTCCCGCTTATACTGGGCTA-3’) were obtained from PrimerBank [121]. Cycle threshold (Ct) values were computed using StepOne software (v2.3). Relative gene expression was calculated using the ΔΔCt method [122] and normalized to housekeeping genes (ACTB and GAPDH) and the corresponding experimental control (NT-ctrl) across all CRISPRi perturbations. Technical triplicates from both clones (10 and 11) were treated as independent when applying two-sided Welch tests between each perturbation and NT-ctrl using the function t_test from the R package ggpubr (v.0.6.1, [123]). Multiple testing correction of adjusted P-values was applied using the Benjamini–Hochberg method.
Gene ontology (GO) enrichment analysis
Gene ontology (GO) enrichment analysis was performed using the “enrichGO()” function of the R clusterProfiler package (v4.6.2, [124]) and specifying the “molecular function” ontology category and the “org.Hs.eg.db” annotation database for the human genome as options. P-values were adjusted for multiple testing using the Benjamini–Hochberg method, with the significance threshold set to 0.05. All the remaining parameters were maintained at their default settings.
Protein–protein interaction (PPI) network analysis
Protein–protein interaction (PPI) networks were constructed by mapping the DEGs onto protein–protein interaction data from STRINGdb (v12.0, [125]), based on a medium-to-high interaction score (confidence threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge$$\end{document} 0.6).
Interaction partners of genes associated with predicted off-target sites were identified using the same approach.
Functional assessment of predicted off-target sites
Off-target prediction was performed using the “CRISPR-Cas9 guide RNA design checker” from Integrated DNA Technologies (IDT, https://eu.idtdna.com/site/order/designtool/index/CRISPR_SEQUENCE, last accessed on August 2025), allowing for up to four mismatches between each of the three sgRNAs targeting MER57E3 copy 1 (chr1:227,564,046–227,564,512) and the MER57E3 copy 1 sequence.
Predicted off-target sites were overlapped with repeat annotation from the UCSC Genome Browser RepeatMasker track. Genomic regions within 20 kbp-window of each of the predicted off-target sites were overlapped with gene annotation from GENCODE Release 46. Functional analysis of the 20 kbp-windows was performed with Genomic Regions Enrichment of Annotations Tool [126] using the default parameters of rGREAT R package (v.4.0.4, [127]).
Supplementary Information
Additional file 1.Additional file 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Hancks DC, Kazazian HH Jr. Roles for retrotransposon insertions in human disease. Mob DNA. 2016;7:9.10.1186/s 13100-016-0065-9PMC 485997027158268 · doi ↗ · pubmed ↗
- 2Platt RN 2nd, Vandewege MW, Ray DA. Mammalian transposable elements and their impacts on genome evolution. Chromosome Res. 2018;26:25–43.10.1007/s 10577-017-9570-z PMC 585728329392473 · doi ↗ · pubmed ↗
- 3Todd CD, Deniz Ö, Taylor D, Branco MR. Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells. Elife. 2019;8:e 44344.10.7554/e Life.44344 PMC 654443631012843 · doi ↗ · pubmed ↗
- 4Ostertag EM, Goodier JL, Zhang Y, Kazazian HH Jr. SVA elements are nonautonomous retrotransposons that cause disease in humans. Am J Hum Genet. 2003;73:1444–51.10.1086/380207 PMC 118040714628287 · doi ↗ · pubmed ↗
- 5Uebbing S, Kocher AA, Baumgartner M, Ji Y, Bai S, Xing X, et al. Evolutionary Innovations in Conserved Regulatory Elements Associate With Developmental Genes in Mammals. Mol Biol Evol. 2024;41(10):msae 199.10.1093/molbev/msae 199PMC 1146537439302728 · doi ↗ · pubmed ↗
- 6Hecker N, Hiller M. A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. Gigascience. 2020;9(1):giz 159.10.1093/gigascience/giz 159PMC 694171431899510 · doi ↗ · pubmed ↗
- 7Grocott T, Lozano-Velasco E, Mok GF, Münsterberg AE. The master control gene initiates spontaneous retinal development via a self-organising Turing network. Development. 2020;147(24):dev 185827.10.1242/dev.185827 PMC 777490433214222 · doi ↗ · pubmed ↗
- 8Andrews, S. (2010). Fast QC: A Quality Control Tool for High Throughput Sequence Data. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 21 Mar 2025.
