Positive Selection Targeted Primate Genes that Encode Transposable Element Repressors
Rachele Cagliani, Diego Forni, Alessandra Mozzi, Roudin Sarama, Uberto Pozzoli, Matteo Fumagalli, Manuela Sironi

TL;DR
The study shows that genes controlling transposable elements in primates have been shaped by positive selection, especially in disordered regions of proteins.
Contribution
The study identifies primate genes under positive selection that encode transposable element repressors, focusing on piRNA pathway and NuRD complex proteins.
Findings
Positive selection targets TE control genes in primates, particularly in intrinsically disordered regions of piRNA pathway proteins.
Three piRNA pathway genes (TEX15, GTSF1, GTSF1L) and four NuRD complex genes show strong evidence of positive selection in human populations.
Selection in disordered regions is linked to changes in sequence patterns and functional ensemble features.
Abstract
Transposable element (TE) mobilization poses a significant fitness challenge to host genomes. Consequently, a variety of systems have emerged to silence TE activity. Just like TEs, such systems are widespread and their evolution is expected to be shaped by intra-genomic conflicts. To test this hypothesis, we performed an evolutionary analysis of TE control systems across different timescales. We show that a substantial fraction of TE control genes were targets of positive selection during primate evolution, with several proteins of the piRNA-pathway showing a considerable number of positively selected sites. In these proteins, selection was strongest in intrinsically disordered regions (IDRs), particularly those with low conformational entropy. We suggest that positive selection modulates IDR properties by introducing changes in sequence patterns and ensemble features properties, which…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig. 1
Fig. 2
Fig. 3
Fig. 4- —Italian Ministry of Health10.13039/501100003196
- —Qaddumi Foundation
- —Al-Quds University
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · RNA Research and Splicing · RNA and protein synthesis mechanisms
Introduction
Transposable elements (TEs) are mobile genetic units that are able to move and amplify within host cells. TEs have successfully populated eukaryotic and prokaryotic genomes, and the typical mammal has ∼45% of its genome derived from TEs (Venter et al. 2001; De Koning et al. 2011; Haudiquet et al. 2022; Osmanski et al. 2023). TEs are highly diverse, but can be classified into two major groups depending on the mobilization mechanism (Finnegan 1989; Wicker et al. 2007). Class I elements, also known as retrotransposons, propagate through an RNA intermediate. These include short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), and long terminal repeat (LTR) retrotransposons (Wicker et al. 2007). Class II TEs, also known as DNA transposons, use a DNA intermediate to propagate (Kapitonov and Jurka 2006; Thomas and Pritham 2015). In mammalian genomes, the large majority of TEs belong to class I (Wells and Feschotte 2020).
Although increasing evidence indicates that they significantly contribute to adaptive genome evolution, TEs behave as selfish elements, and their replication at the genomic level can be detrimental (Doolittle and Sapienza 1980; Jangam et al. 2017; Bourque et al. 2018; Joly-Lopez and Bureau 2018; Schrader and Schmitz 2019; Almojil et al. 2021; Almeida et al. 2022). TEs can be mutagenic, either directly (e.g. by insertional disruption of coding or regulatory sequences) or indirectly (e.g. by mediating chromosomal rearrangements), and TE insertions can affect the regulatory environment of nearby genes (Almeida et al. 2022). For these reasons, cellular organisms have evolved several mechanisms to suppress TE activity and to limit its propagation. This is particularly true in the germline, where TE mobilization can lead to vertical transmission of new genomic copies. In mammals, a major TE control pathway hinges on P-element induced Wimpy testis (PIWI)-interacting RNAs (piRNAs). piRNAs form effector complexes with Piwi proteins (Piwis), the latter a germline-specific class of Argonaute proteins (Agos). These complexes guide recognition and silencing of TEs at the transcriptional or post-transcriptional level (Iwasaki et al. 2015; Ozata et al. 2019). For transcriptional silencing, complexes of piRNAs and Piwi use specific interactors (e.g. SPOCD1 and TEX15 in mammals) to promote the recruitment of DNA methyltransferases (DNMTs) and chromatin remodeling complexes to generate a repressive chromatin environment (Almeida et al. 2022). In the case of post-transcriptional control, piRNAs complexed with Piwi, together with co-factors, promote the degradation of TE transcripts in the cytoplasm (Almeida et al. 2022; Arif et al. 2022). One extremely relevant cofactor is GTSF1 (Gametocyte-specific factor 1), which binds the ternary complex of piRNA•PIWI•target to potentiate the cleavage activities of Piwi proteins (Arif et al. 2022). In mammals, the piRNA/Piwi pathway is particularly important in the germline. As a consequence, mutations in several components of the pathway result in the sterility of one or both sexes (Almeida et al. 2022).
Another mechanism of TE control is based on Krüppel-associated box (KRAB) domain-containing zinc finger proteins (KZFPs), which are encoded by an expanded gene family in vertebrates (Thomas and Schneider 2011). Through the KAP1/TRIM28 interactor, KZFPs recruit heterochromatin silencing factors, including DNMTs, heterochromatin protein 1 (CBX1/CBX3), SETDB1, and the NuRD complex (nucleosome remodeling and deacetylation complex). Moreover, additional systems exploit different TE features for recognition and silencing. For instance, the HUSH (human silencing hub) complex (which comprises TASOR, MPHOSPH8, and PPHLN1), together with its effectors (MORC2 and SETDB1), identifies DNA fragments with high adenine content in the sense strand and epigenetically represses them (Seczynska and Lehner 2023). Other mechanisms of TE suppression are instead based on RNA modifications, including N6-methyladenosine, 3′ uridylation, and 5-hydroxymethylcytosine (Almeida et al. 2022).
Because several proteins involved in TE control bind nucleic acids, they are expected to be enriched in intrinsically disordered regions (IDRs) (Liu et al. 2006; Zhao et al. 2021). IDRs do not adopt a stable three-dimensional structure under physiological conditions, but rather exist in conformational ensembles of energetically accessible, rapidly interconverting structures (Holehouse and Kragelund 2024). IDR amino acid composition, as well as the clustering and patterning of charged, polar, and hydrophobic residues, determine key properties of the conformational ensemble and, consequently, modulate functional properties (Holehouse and Kragelund 2024). IDRs are abundant in mammalian proteomes and, as a consequence of their limited structural constraints and tolerance to change, they are generally considered to evolve at a faster rate compared to structured regions (Holehouse and Kragelund 2024). This might be relevant in the context of TE control proteins, as the antagonistic effect exerted by cellular pathways against TE propagation is expected to result in intra-genomic conflicts (Hurst et al. 1996; Werren 2011). Indeed, TE propagation depends on evading cellular control systems and replication in the germline, while hosts need to silence TEs to ensure genome integrity and fertility. In line with these concepts, it was previously shown that KZFPs diverged and amplified in vertebrate lineages in response to the emergence of novel retroelements (Thomas and Schneider 2011; Jacobs et al. 2014) and de novo DNA methyltransferases show dynamic evolution in primates and rodents (Molaro et al. 2020). In Drosophila, a significant subset of piRNA pathway genes evolved under positive selection, and the same holds true for teleost fishes (Obbard et al. 2009; Lee and Langley 2012; Simkin et al. 2013; Yi et al. 2014; Parhad and Theurkauf 2019). Recently, signals of adaptive evolution were also reported in the HUSH complex effector MORC2 in primates (Lasserre et al. 2023).
Recent TE activity has been documented in the human genome and about 100 evolutionarily young LINE1 elements are still mobile in our genomes (Osmanski et al. 2023). These autonomous LINE1s provide the enzymatic machinery for their own integration, as well as for the insertion of the nonautonomous SINEs (Alu and SVA). Recent estimates indicated a retrotransposition rate for LINE1, Alu, and SVA ranging from one in 40 to one in 60 live births, depending on the element (Feusier et al. 2019). These new insertions have the potential to cause human genetic diseases, and elevated LINE1 expression and retrotransposition have been associated with many types of human cancers, as well as with male sterility (Hancks and Kazazian 2016; Yang and Wang 2016; Mendez-Dorantes and Burns 2023). Thus, TEs are likely to have exerted a selective pressure not only in the distant past but also during the more recent evolutionary history of human populations. A continuum in selective pressure acting on different timescales might therefore be expected, although distinct waves of TE expansion may exert pressure on distinct genes. In this study, we integrated molecular evolution analysis, population genetics, and structural biology to delve into the evolutionary history of TE control systems in primates and in human populations.
Results
Positive Selection Drives the Evolution of Several TE Control Genes
We sought to test whether genes encoding products that suppress TE mobilization have been targets of positive selection during primate evolution. We thus assembled a list of sixty coding genes involved in TE control (Fig. 1a, Table S1). We first sought to determine whether the protein products of these genes are engaged in a genetic conflict with TEs. If this were the case, evidence of positive selection might be expected as a signature of fast evolution in response to TE challenge. We thus focused on primates and retrieved sequence information of the coding sequences from public databases, with the aim to analyze their evolutionary patterns (Table S1). To test the hypothesis that positive selection has been driving the evolution of TE control genes, we applied likelihood ratio tests (LRTs) implemented in the PAML suite (Yang 1997; Yang 2007). All genes were screened for recombination and split into different regions if necessary. The neutral models (M7 and M8a) were rejected in favor of the positive selection model (M8) for 14 genes, corresponding to a considerably high fraction of 23.3% (Fig. 1a and b, Table S2). The positively selected genes included three Piwi proteins and other molecules that participate in the piRNA pathway (SPOCD1 and TEX15), as well as components of the HUSH and NuRD complexes. In line with previous findings, we detected evidence of positive selection in MORC2 (Lasserre et al. 2023) and in the N-terminal region of DNMT3A (Molaro et al. 2020) (Fig. 1b).
Evolution of TE control systems in primates. a) Protein-protein interaction network predicted for the 60 analyzed proteins. The network was generated using STRING (https://string-db.org/), which covers both physical interactions and functional associations between proteins. Proteins under positive selection in primates are colored in red. NuRD and HUSH complex components are shown in green and blue, respectively. b) Domain structures of the 14 proteins under positive selection in the primate phylogeny. Schematic domain structures of human proteins are drawn to scale. The protein domains were obtained from the InterPro website (https://www.ebi.ac.uk/interpro/) (Blum et al. 2025), with the exception of the Piwi proteins. The domains of PIWIL2 were obtained from Li et al. (2024), while the domains of PIWIL3 and PIWIL4 were defined by sequence homology from PIWIL2. IDRs identified by the Metapredict tool based on human proteins are shown in blue. Positively selected sites are shown below each domain structure as arrowheads.
We were also interested in identifying the positively selected sites in these genes. Using a conservative approach (see methods), a total of 149 selected sites were identified (Table S2). The average fraction of selected sites per protein resulted equal to 0.88%, with the highest proportion being observed for SPOCD1 (3.45%). In general, components of the piRNA pathway were found to display a high number of selection signals (Fig. 1b). Among these, we detected three sites that were independent targets of selection in the PAZ (PIWI-Argonaute-Zwille) domains of the two paralogous genes PIWIL3 (I310 and R311, residue numbering according to the human protein) and PIWIL4 (Q282). These sites define the same structural patch in the 3D structures of the corresponding Piwi proteins (Fig. 2a), suggesting that changes in this region are functionally relevant and adaptive. Although a number of positively selected sites were found to be located in the MID and PAZ domains, none of them is involved in the binding of guide RNAs (Fig. 2a).
a) Cartoon representation of the 3D structure of human PIWIL2 in complex with piRNA (PDB ID: 7yfx). Protein domains are color-coded as reported in the legend. piRNA is in dark gray. Residues involved in piRNA recognition and the catalytic tetrad are represented as balls and sticks, in orange and in magenta, respectively. Below is reported the structural alignment of the human PIWIL3 (dark gray) AlphaFold model (AlphaFold DB: AF-Q7Z3Z3-F1-v4) with the human PIWIL4 (light gray) AlphaFold model (AlphaFold DB:AF-Q7Z3Z4-F1-v4) with the same spatial orientation. Positively selected sites are reported as ball and sticks, and color-coded according to the domain they map to (see legend). Positively selected sites in PIWIL3 (red) and PIWIL4 (magenta) falling in the same structural region are shown in the enlargement. b) Structural model of the human PIWIL4-GTSF1-MAEL-piRNA-target duplex-TDRD9 complex (prediction score = 0.726). PIWIL4 positively selected sites are in red. The positively selected residue (R645) that makes contact with GTSF1 is reported in the enlargement.
Given the recent resolution of PIWI* complex assemblies—which comprise a PIWI protein, a piRNA-target duplex, a GTSF protein and the Maelstrom protein—and that serves as molecular hub for downstream effectors, we reconstruct by structural modeling the human PIWIL2/3/4* complexes with TDRD9, which was found to be a general interactor of PIWI* complexes (De et al. 2025; Portell-Montserrat et al. 2025).
Only for PIWIL2* and PIWIL4*—TDRD9 we obtained models reaching confident prediction scores (>0.7). Whereas the positively selected residues detected in PIWIL2 lie in a disordered region that is not involved in interaction with assembly components, the R645 site of PIWIL4, which was identified as positively selected in the PIWI domain, maps to the interaction surface with GTSF1 (Fig. 2b and Fig. S1). In particular, in the model, the R645 residue makes a polar interaction with S96 and W98 of GTSF1. W98 is a conserved aromatic residue that, together with the other two conserved tryptophan residues, was previously suggested to be responsible for the interaction of GTSF1 with PIWI proteins (Arif et al. 2022).
These observations indicate that positive selection may have acted to modulate inter-protein interactions within the piRNA pathway.
Positive Selection Signals in TE Control Genes are Enriched in Intrinsically Disordered Regions
The proteins encoded by these TE control genes have different domain architectures and, because several of them bind nucleic acids, they are expected to contain IDRs) (Liu et al. 2006; Zhao et al. 2021; Holehouse and Kragelund 2024). We thus sought to test this expectation and to determine whether, as shown for other proteins, IDRs represented preferential targets of positive selection (Cagliani et al. 2024a, 2024b). We thus predicted their location in the reference human proteome using Metapredict V2 (Emenecker et al. 2021, 2022). We found that IDRs are significantly more common in TE control proteins compared to the proteome average (IDR fraction in TE control proteins = 0.43, average IDR fraction in the human proteome = 0.29, Wilcoxon rank sum test P-value = 1.7 × 10^−5^), and some of them are very long. By analyzing the location of positively selected sites, we observed that a large number of them map to IDRs (Fig. 1b, Table S2). To determine whether this represents a statistically significant enrichment, we used a binomial test that accounts for the fraction of sites embedded in IDRs. We found that the proportion of sites in IDRs is indeed significantly higher than expected by chance (P-value = 1.40 × 10^−12^).
To test whether these results were robust to a different IDR prediction method, we obtained consensus disorder annotations provided by MobiDB (Piovesan et al. 2025) for all genes we analyzed. We found a very good correlation (Pearson's correlation coefficient = 0.855, P = 3.6 × 10^−18^) between the disordered fractions from Metapredict and MobiDB, with no evidence that either method over- or under-predicts the IDR fraction (Fig. S2). Consistently, the proportion of selected sites in IDRs obtained from MobiDB resulted to be significantly enriched (P = 5.92 × 10^−8^).
Sequence Patterns and Ensemble Features Differ in IDRs Depending on the Selective Regime
We next hypothesized that the conformational properties of the IDRs (e.g. the range of conformations that are accessible to an IDR or its chain compactness) may impact selective constraint or be influenced by the positively selected residues. Although the structural properties of IDRs cannot be predicted, some ensemble properties are quantifiable and provide information on three-dimensional features and IDR function (Tesei et al. 2024). These include two length-independent parameters: the conformational entropy per residue (Sconf/N) and the Flory scaling exponent (ν), a measure of chain compactness. These two features are nonindependent, as compact IDRs tend to have low Sconf/N (Tesei et al. 2024). We thus used a predictor based on support vector regression (SVR) models to calculate Sconf/N and ν for all orthologous IDRs in the primate TE control proteins (Tesei et al. 2024) (Table S3). To assess whether the conformational properties influence selective constraints, for all IDRs, we also calculated the fraction of sites targeted by purifying selection. No correlation was observed between the fraction of negatively selected sites and either Sconf/N or ν (Kendall's correlation coefficients: 0.081 and 0.027, both P-values > 0.05). This indicates that the selective constraint acting on IDRs is not related to their ensemble conformational features. We next compared ensemble features between IDRs that were or were not targeted by positive selection (i.e. that display at least one positively selected site or none). Interestingly, we found that positively selected IDRs have significantly lower Sconf/N and tend to be more compact, although the difference in ν was not statistically significant (Fig. 3a).
Conformational features across primate orthologous IDRs. Violin wrapping box and whiskers plots of conformational features calculated for all IDRs in TE control proteins. Plots show the mean values a) or the standard deviation b) among orthologs and colors represent IDRs that are (red) or are not (gray) targeted by positive selection. Wilcoxon rank sum test P-values are also reported.
Previous studies showed that the amino acid composition and patterning of residues are important determinants of sequence-ensemble relationships in IDRs (Das et al. 2015; Holehouse et al. 2017; Sherry et al. 2017; Zarin et al. 2017; Beveridge et al. 2019; Zarin et al. 2019). We thus used the SVR predictor to calculate the following parameters: sequence hydropathy decoration (SHD, a measure of the patterning of hydrophobic residues) (Zheng et al. 2020), fraction of charged residues (FCR), and net charge per residue (NCPR) (Fig. 3a, Table S3). We found that positively selected IDRs have significantly higher SHD compared to the nonselected ones. They also tend to have fewer charged residues and more negative charge, although the differences are not significant (Fig. 3a). Overall, this suggests that compaction and low entropy in selected IDRs are due to the patterning of hydrophobic/aromatic residues.
To investigate how positive selection may impact IDR compositional patterns and ensemble properties, we calculated the standard deviations among primate orthologs of conformational parameters (Sconf/N and ν) and sequence features (SHD, FCR, and NCPR). Using this approach, we obtained a single measure of feature conservation for each IDR. A comparison between positively selected and nonpositively selected IDRs indicated that the former have more variability in both ensemble features and sequence parameters than the latter, although statistical significance was only reached for SHD and NCPR (Fig. 3b). These results suggest that positive selection modulates IDR properties by introducing changes in sequence patterns and ensemble features properties, which are in turn related to function (Tesei et al. 2024).
TE Control Genes as Targets of Positive Selection During the Recent Evolution of Human Populations
We next hypothesized that selection acting on TE control genes may have been widespread in recent human evolutionary history. Accordingly, we examined deviations from neutrality across 54 globally diverse populations represented in the Human Genome Diversity Project (HGDP) (Bergström et al. 2020) (Table S4, Fig. 4a). In order to assess whether TE-control genes bear signatures of selection, we compared nucleotide diversity (π) across windows overlapping candidate genes against a randomly sampled subset of background genes. π reflects the average number of nucleotide differences per site between two sequences and is sensitive to both historical and recent selective pressures. Our analysis revealed a significantly lower median π in candidate genes compared to background genes (Wilcoxon rank-sum test, P-value = 0.00538) (Fig. 4b). This indicates that, across the genome, TE-control genes tend to exhibit reduced genetic diversity, a pattern consistent with the action of purifying or recurrent positive selection.
Human population genetics of TE control genes. a) Global map showing the sampling locations of the Human Genome Diversity Project populations, with the number of individuals per population indicated in parentheses. b) Nucleotide diversity (π) comparison of values for TE control genes and background genes using 1,000 subsampling. c) Distribution of TE control genes with extremely low summary statistics values across populations. The y axis shows genes of interest, while the x axis indicates the number of populations where each gene exhibits extremely low values (below the 5% threshold) for Tajima's D or Fay and Wu's H. Bars are color-coded by statistic, with transparency reduced for counts below the threshold of 10 populations (indicated by the dashed red line marks). Gene names in bold reflect those with extreme values for both D and H statistics, in at least 10 populations.
We next examined the distribution of summary statistics (Tajima's D and Fay & Wu's H) (Tajima 1989; Fay and Wu 2000) for candidate genes across the 54 populations. Tajima's D and Fay and Wu's H assess evidence of long-term, nonrecent positive selection. Specifically, extreme values of Tajima's D identify deviations from genetic neutrality, while low values of Fay and Wu's H capture signals of high-frequency derived alleles (Nielsen et al. 2007; Ramos-Onsins et al. 2023). To account for the effect of demographic effects (Voight et al. 2006), for each test, we examined the distribution of genes with extremely ranked values (low 5%) of both statistical tests across the populations. We identified 22 and 17 genes with extremely low values of D and H, respectively, in at least 10 human populations (Fig. 4c, Table S5). Notably, 13 genes exhibited concurrent extreme values for both D and H tests in 10 or more populations, compatible with signals of positive selection of these genes across a diverse group of populations (Fig. 4c).
Discussion
TE mobilization poses a significant fitness challenge to host genomes. As a consequence, a variety of systems have emerged to curb and silence TE activity. Just like TEs, such systems are widespread across the tree of life, and their evolution is expected to be shaped by recurrent arms races, as new TE families emerge and start amplifying. In addition to TEs, viruses can also exert a selective pressure on some of these genes, as the pathways that control retrotransposition are often also involved in the restriction of exogenous viruses (Bobadilla Ugarte et al. 2023). For instance, the HUSH complex is a crucial repressor of HIV, and it is counteracted by the Vpr and Vpx viral proteins (Chougui et al. 2018; Yurkovetskiy et al. 2018).
Here, we explored the evolution of proteins that control TEs over different timescales. Our data indicate that, in primates, a considerable fraction of TE control genes evolved under positive selection, an observation consistent with a genetic conflict scenario (Sironi et al. 2015). In particular, proteins that participate in the piRNA pathway were found to have experienced very strong selection, resulting in the identification of many sites showing fast amino acid replacement. These proteins play a central role in the silencing of retrotransposon activity in the germline, where epigenetic reprogramming results in the derepression of TEs (Almeida et al. 2022). The germline also represents the stage where the genomic conflict plays out, as new TE insertions can be transmitted through vertical inheritance only if they occur in the germline or before its specification (i.e. in the early stages of embryonic development). Indeed, the expression of many TEs seems to be restricted to various stages of gametogenesis or early embryogenesis, whereas activity in somatic tissues holds no apparent benefit to TEs (Chuong et al. 2017). That primates (like most cellular organisms) have evolved additional layers of TE control in the germline, and that these layers experience the strongest selective pressure, are thus in line with expectations under an arms race scenario. Among PIWI proteins, we found that PIWIL2, PIWIL3, and PIWIL4, but not PIWIL1, represent selection targets. Despite sequence and structural similarities, the four PIWI proteins display distinctive functional features and expression patterns. Data from rodent models indicated that PIWIL1, PIWIL2, and PIWIL4 are expressed during male germ cell development and all three genes are essential for spermatogenesis (Iwasaki et al. 2015). PIWIL2 and PIWIL4 mainly function in transposon silencing in embryonic germ cells by associating with fetal piRNAs (Aravin et al. 2008; De Fazio et al. 2011). PIWIL1 instead has a major function in post-natal germ cells, where it associates with pachytene piRNAs and uses its slicing activity to silence TEs (Reuter et al. 2011; Wei et al. 2023). PIWIL1 and PIWIL3 are also expressed in oocytes and, in a hamster model, deletion of both genes determines reduced fertility, although only PIWIL1 contributes to TE repression (Hasuwa et al. 2021; Zhang et al. 2021). It is thus somewhat surprising that PIWIL1 did not show evidence of positive selection. PIWIL1 is known to load pachytene piRNAs, which are highly abundant in the adult mouse testis and largely devoid of TE sequences (Ozata et al. 2019). The precise function(s) of pachytene piRNAs are still a matter of debate. Previous works suggested that they regulate protein-coding mRNAs by promoting their deadenylation and decay (Gou et al. 2014) and that, through interaction with other proteins, the piRNA/PIWIL1 activates the translation of spermiogenic mRNAs (Dai et al. 2019). However, these pieces of evidence were not supported by more recent findings (Wu et al. 2020; Choi et al. 2021; Cecchini et al. 2026). Instead, Cecchini and coworkers showed that a small minority of pachytene piRNAs direct the endonucleolytic cleavage of few partially complementary targets, a function essential for spermatogenesis (Cecchini et al. 2026). Whereas further analyses will be required to elucidate the role of pachytene piRNAs and PIWIL1 in male fertility, it is possible that functional constraints or the necessity to interact with protein partners limit the ability of PIWIL1 to evolve in response to TE-derived selective pressures.
We should also mention that positive selection is not always determined by genetic conflicts (with either endogenous or exogenous elements), but may derive from other forces. For instance, selfish spermatogonial selection (which favors variants that increase the rate of cell division or decrease apoptosis in a given germline cell) was suggested to account for the positive selection signals in some testis-expressed genes (Nielsen et al. 2005).
Recent TE activity has also occurred in humans and a few elements, all of them retrotransposons, are still mobile in our genomes, as exemplified by polymorphic insertions (Beck et al. 2010; Osmanski et al. 2023). For instance, SVAs show evidence of relatively high recent activity (Osmanski et al. 2023), and about one hundred transposition-competent LINE1s are present in our genomes (Beck et al. 2010). We thus screened the TE control genes for evidence of positive selection in human populations, which may result from the ongoing pressure exerted by mobile elements. We detected several signals of strong positive selection across populations. In line with findings in primates, we identified as selection targets three genes (TEX15, GTSF1, and GTSF1L) that participate in the piRNA pathway (Almeida et al. 2022). In rodents, GTSF1 and GTSF1L are expressed in the male germline, where they associate with PIWI proteins to potentiate target cleavage (Arif et al. 2022). Deletion of Gtsf1 results in male sterility, whereas knockout of Gtsf1l does not impair spermatogenesis or retrotransposon repression in mice, although more subtle phenotypic changes cannot be excluded (Takemoto et al. 2016; Yoshimura et al. 2018). Whereas this may suggest functional redundancy, more recent evidence indicates that GTSF1 and GTSF1L (as well as a third paralog, GTSF2) may have specialized roles in enhancing target cleavage by specific PIWI proteins (Arif et al. 2022). If a similar specialization has occurred in primates, the signatures of selection we detected may reflect fine-tuning of protein–protein interactions or expression patterns to match those of distinct PIWI proteins. Alternatively, these genes may have acquired divergent functions or expression profiles in primates. Experimental studies will be required to distinguish between these possibilities. The third component of the piRNA pathway, TEX15, is also associated with male sterility in both mice and humans (Yang et al. 2008; He et al. 2013). In the male germline, TEX15 interacts with PIWIL1 and PIWIL2 and is required for piRNA-directed de novo DNA methylation (Schöpp et al. 2020; Yang et al. 2020). TEX15 has been a target of positive selection across different evolutionary timescales—during primate evolution and more recently in human populations. It will be particularly informative to determine, through population-level analyses, whether selection acting on TEX15 or other components of the piRNA pathway is linked to male reproductive fitness.
With the exclusion of TEX15, however, the genes that experienced the strongest selective pressure during the recent evolution of human populations did not overlap with those undergoing fast evolution in primates. Interestingly, four of the 13 genes with strong evidence of selection encode components of the NuRD complex. In the context of TE suppression, this complex interacts with KAP1 and mediates de novo heterochromatin formation, thus transcriptionally suppressing LTR retrotransposons (Rowe et al. 2010). However, NuRD plays additional functions in the cell and its activity is essential for the regulation of gene expression, cell fate specification, and cell cycle progression (Hoffmann and Spengler 2019). As a consequence, deregulation of NuRD has been associated with cancer and neurodevelopmental disorders (Lai and Wade 2011; Hoffmann and Spengler 2019). Moreover, NuRD is the target of several infectious viruses, most notably herpesviruses (Savaryn et al. 2013; Salamun et al. 2019; Naik et al. 2020), that hijack transcriptional regulation. It is thus possible that the selective pressures acting in human populations on NuRD components are not directly linked to its TE-silencing activity. It is also worth noting that variants in some of the genes (MTA1, RBBP4, AGO1, SAFB2, and TET1) showing strong evidence of positive selection in human populations are associated to anthropometric traits (waist-to-hip ratio adjusted for BMI, height) (https://www.ebi.ac.uk/gwas/home). In turn, such traits were shown to carry evidence of polygenic adaptation in humans (Turchin et al. 2012; Robinson et al. 2015; Polimanti et al. 2016; Guo et al. 2018), raising the possibility that the signals we detected are related to the pleiotropic functions of these genes.
Given that their biological functions often involve nucleic acid binding, the proteins encoded by the genes we analyzed are particularly rich in IDRs (Liu et al. 2006; Zhao et al. 2021), which pose challenges in their analysis: they are, in general, poorly conserved and, because they lack stable secondary or tertiary structures, the biological and biochemical functions of IDRs are difficult to predict (Holehouse and Kragelund 2024). As a consequence, IDRs and their evolutionary trajectories remain under-investigated (Tesei et al. 2024). We found that signals of positive selection are significantly enriched within the disordered portions of proteins involved in TE control. Whereas this finding is in line with a previous study of human IDRs, it opens the question of how changes in IDRs can promote adaptation and which biological functions are affected by such changes (Afanasyeva et al. 2018). We thus applied recently developed methods to investigate IDR sequence-ensemble properties in relation to evolutionary parameters. As growing evidence indicates that chain compaction and conformational entropy are relevant descriptors of IDR function, we first asked whether these parameters correlate with the level of selective constraint, measured as the fraction of sites experiencing purifying selection. We found no evidence of correlation, which was somewhat unexpected because previous data showed that human pathogenic variants are more likely to be located in IDRs with low conformational entropy compared to benign variants (Tesei et al. 2024), suggesting that such IDRs are less tolerant to change. It is however possible that human IDRs are heterogeneous with respect to the relationship between tolerance and entropy, which may depend on protein function. Analysis of positive selection signals, however, showed that they are preferentially located in regions of low conformational entropy. We thus hypothesized that, by altering the sequence composition or patterning of the IDRs, the positively selected sites might modulate the conformational ensemble features and, hence, IDR functions. By calculating across-orthologs variance, we verified this hypothesis, although statistical significance was often borderline, most likely due to the small number of comparisons. Experimental investigation or molecular dynamic simulations will be required to determine the effect of positive selection on IDR function in these proteins. Nonetheless, our data show that the evolutionary trajectories of IDRs may be linked to their conformational characteristics, exactly in the same way as structured domains often evolve by changes that affect the solvent-exposed surface rather than the inner hydrophobic cores. This conclusion is reminiscent of findings on IDRs in proteins that derived from retrotransposon domestication, suggesting that it may represent a general feature of IDR evolution in mammals (Cagliani et al. 2024b). Unfortunately, the specific functions of IDRs in these TE control proteins are mostly unknown and, because they cannot be investigated in classical structural analyses (IDRs typically fail to be solved in crystals), their relative contribution to the formation of protein–RNA complexes is unexplored. In general, IDRs are known to often mediate multivalent inter-molecular interactions (Holehouse and Kragelund 2024). Thus, an interesting possibility is that selected changes in IDRs modulate the interactions among protein–protein and protein–RNA complexes, eventually contributing to the regulation of silencing activities. IDRs are also known to represent common targets of post-translational modifications (PTMs) and to often host eukaryotic linear motifs (ELMs, short linear sequences important for protein regulation) (Holehouse and Kragelund 2024). Both PTMs and ELMs confer an additional layer of protein regulation and may be affected by the selected sites. Experimental analyses using site-directed mutagenesis of the positively selected sites will be required to shed light on their role in protein regulation, complex formation, and TE silencing.
In summary, we performed an evolutionary analysis of TE control systems across different timescales. Our working hypothesis was that the evolution of such systems has been dynamic, as a result of genetic conflicts with TEs. In line with this scenario, we show that positive selection drove the evolution of TE control proteins in primates and human populations. This dual-timescale approach reveals that different genes and pathways, such as the piRNA pathway in primates and the NuRD complex in humans, have been under distinct selective pressures over time. This provides a more complete and dynamic picture of the genomic conflict with transposable elements. Thus, our data add up to similar observations in other organisms, including Drosophila and fish, indicating that TE control proteins are fast evolving (Obbard et al. 2009; Lee and Langley 2012; Simkin et al. 2013; Yi et al. 2014; Parhad and Theurkauf 2019). As noted elsewhere, though, a major open question is to determine the mechanistic details driving the conflict (Blumenstiel et al. 2016). TEs are not known to encode antagonists of proteins involved in their repression and the capacity of TEs to evolve evasion mechanisms of the silencing machinery may be limited (Blumenstiel et al. 2016). TE diversity has been regarded as a driver of adaptation of the piRNA pathway (Lee and Langley 2012; Yi et al. 2014). However, an analysis in Drosophila showed that the rates of evolution in species with high TE burden and diversity are slower than in species with more abundant and diverse TEs (Castillo et al. 2011). It was thus suggested that evolution of the piRNA pathway (and possibly other proteins involved in TE control) results from a trade-off between genome defense and the costs of off-target gene silencing—that is a balance between sensitivity and specificity (Blumenstiel et al. 2016). To test this hypothesis, it would be useful to determine whether amino acid changes at the positively selected sites modulate sensitivity or off-target effects. In this respect, it is also worth mentioning that, whereas in the analysis of primate genes we focused on protein sequence variation, the selection signatures that we detected in human populations are likely to be primarily driven by regulatory changes (Grossman et al. 2013). Again based on the Drosophila model, it was previously suggested that adaptation to high TE burden can also be obtained by increased expression of the piRNA machinery (Castillo et al. 2011). These observations might also explain why we detected limited overlap in the gene sets that were selected in primates and during the evolution of human populations. Possibly, different adaptive mechanisms drive the evolution of distinct genes or the response to specific TE lineages. It will thus be extremely important to determine whether the selected variants modulate the level or timing of expression of TE control genes in humans. Also, the fine mapping of the selection signals in humans will provide a list of candidate variants for functional analyses.
Materials and Methods
Gene Selection
The list of 60 coding genes involved in TE control was derived from previous reviews and individual studies (Aktaş et al. 2017; Almeida et al. 2022; Wang et al. 2023; Zhao et al. 2023; Ilık et al. 2024) (Table S1). KZFPs were not included because (i) they were previously analyzed and (ii) their fast evolution in terms of gene copies makes it difficult to reconstruct orthology relationships (Thomas and Schneider 2011; Jacobs et al. 2014; Kosuge et al. 2024).
Primate Sequence Retrieval and Positive Selection Analysis
We obtained primate orthologous coding sequences from the NCBI database (http://www.ncbi.nlm.gov). Specifically, we exploited data from the NCBI's Eukaryotic Genome Annotation Pipeline to derive primate one-to-one orthologs (NCBI Orthologs) (Oh et al. 2025). Primate genes carrying premature stop codons or with a low sequence coverage were excluded. A list of primate species analyzed for each gene is reported in Table S1. To evaluate the quality and the completeness of the analyzed genomes, different assembly statistics were retrieved from NCBI, confirming an overall high quality (Table S1).
The RevTrans 2.0 utility was used to generate multiple sequence alignments (MSAs) using MAFFT v6.240 as an aligner (Wernersson 2003; Katoh and Standley 2013). The resulting alignments were manually trimmed and adjusted to exclude segments with gaps in most species and to include only well-aligning regions. MSAs were analyzed for the presence of recombination signals using Genetic Algorithm Recombination Detection (GARD, parameters: general discrete model, three rate classes) (Pond et al. 2006). GARD is a genetic algorithm based on phylogenetic incongruence among alignment segments, which detects the best-fit number and location of recombination breakpoints. When evidence of recombination was detected, the coding alignment was split based on the recombination breakpoints and sub-regions (of at least 50 nucleotides) were used as input for subsequent molecular evolution analyses. Recombination breakpoints were identified for MOV10L1, PIWIL3, and TNRC18 alignments (Table S2).
Phylogenetic trees were reconstructed using the phyML program (version 3.1) with a General Time Reversible (GTR) model plus gamma-distributed rates and four substitution rate categories with a fixed proportion of invariable sites (Guindon et al. 2009). The MSAs and phyML trees were used as inputs to perform evolutionary analysis. To detect positive selection, the codon-based codeml program implemented in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite was applied (Yang 2007), using the F3x4 codon frequencies model (codon frequencies estimated from the nucleotide frequencies in the data at each codon site) (Yang 1997; Yang 2007). A model (M8, positive selection model) that allows a class of sites to evolve with dN/dS > 1 was compared to two models (M7 and M8a, neutral models) that do not allow dN/dS > 1. To assess statistical significance, twice the difference of the likelihood (ΔlnL) for each model (M8a vs. M8 and M7 vs. M8) is compared to a χ^2^ distribution (1 and 2 degrees of freedom for M8a vs. M8 and M7 vs. M8 comparisons, respectively). To be conservative and to obtain robust results, we called a gene as a target of positive selection only if it was detected by both M7 versus M8 and M8a versus M8 comparisons.
In order to identify specific sites subject to positive selection, we applied four different methods: (i) the Bayes Empirical Bayes (BEB) analysis (with a cutoff of 0.90), which calculates the posterior probability that each codon is from the positive selection site class (under M8 model) (Anisimova et al. 2002); (ii) Fast Unbiased Bayesian AppRoximation (FUBAR) (with a cutoff of 0.90), an approximate hierarchical Bayesian method that generates an unconstrained distribution of selection parameters to estimate the posterior probability of positive diversifying selection at each site in a given alignment (Murrell et al. 2013); (iii) Mixed Effects Model of Evolution (MEME) (with a P-value cutoff < 0.1), which allows the distribution of dN/dS to vary from site to site and from branch to branch at a site (Murrell et al. 2012); (iv) Fixed Effects Likelihood (FEL) (with a P-value cutoff < 0.1), a maximum-likelihood (ML) approach to infer dN/dS on a per-site basis, assuming that the selection pressure for each site is constant along the entire phylogeny (Kosakovsky Pond and Frost 2005). To be conservative and to limit false positives, only sites detected using at least two methods were considered as positive selection targets.
The single-likelihood ancestor counting (SLAC) method was applied to identify sites under negative/purifying selection (Kosakovsky Pond and Frost 2005). GARD, FUBAR, MEME, FEL, and SLAC analyses were run locally through the HyPhy suite V2.5.29 (Pond et al. 2005).
Analysis and Prediction of Protein Three-Dimensional Structures
The structure of PIWIL2 in complex with piRNA was derived from PDB (https://www.wwpdb.org/, ID:7yfx), whereas the structural models of human PIWIL3 and PIWIL4 were obtained from the AlphaFold database. TM-align (https://zhanggroup.org/TM-align/) (Zhang et al. 2005) was run for pairwise structure alignments.
AlphaFold3 (Abramsom et al. 2024), through the AlphaFold Server (https://alphafoldserver.com/), was used to model the structure of the human PIWIL4-GTSF1-MAEL-piRNA-target RNA-TDRD9 complex. 0.8×ipTM + 0.2×pTM > 0.7 was applied as a confidence metric for quality estimation (Shor et al. 2024).
3D structures were rendered using PyMOL (The PyMOL Molecular Graphics System, Version 1.8.4.0; Schrödinger, LLC).
Prediction of Intrinsically Disordered Regions
The sequences of reviewed (Swiss-Prot) canonical proteins (20,420) of the reference human proteome (UP000005640) were retrieved from Uniprot (https://www.uniprot.org/). Intrinsically disordered regions (IDRs) were estimated using the Metapredict V2 tool (Emenecker et al. 2021, 2022), that defines which residues from a protein sequence are disordered by applying a deep-learning algorithm based on a consensus score calculated from eight different predictors (Emenecker et al. 2021). Metapredict V2 was run using default parameters and consecutive disordered stretches equal to or longer than 30 residues were labeled as IDRs. For primate genes, IDRs were kept only if the same orthologous region was predicted in at least 50% of the species analyzed. Disorder annotations were also derived from MobiDB (Piovesan et al. 2025). This database aggregates disorder information from literature, experimental data, and predictions for all known proteins recorded in Uniprot. The consensus disorder annotations were used in our analysis.
Analysis of IDR Conformational Properties and Sequence Patterns
The conformational entropy per residue (Sconf/N) and the Flory scaling exponent (ν) were calculated for all analyzed IDRs. Sconf/N is a measure of the landscape of different structures accessible to an IDR. ν derives from the scaling laws of polymers that describe how chain dimensions vary as a function of chain length (Flory and Volkenstein 1969). Both parameters were estimated using a Colab notebook (https://colab.research.google.com/github/KULL-Centre/_2023_Tesei_IDRome/blob/main/IDR_SVR_predictor.ipynb) (Tesei and Lindorff-Larsen 2022; Tesei et al. 2024), which uses a support vector regression model trained on simulations performed using the CALVADOS model (Tesei et al. 2021; Tesei and Lindorff-Larsen 2022). The same SVR predictor was used to derive other sequence measures: SHD (sequence hydropathy decoration) (Zheng et al. 2020), FCR (the fraction of charged residue), and NCPR (net charge per residue).
Population Genetics Analyses
We conducted our analysis on a dataset of variant call format (VCF) files comprising genetic variation data of autosomal chromosomes from 54 distinct human populations and 828 individuals from the Human Genome Diversity Panel (Bergström et al. 2020) (Table S4). VCF files were further indexed and filtered for minimum quality parameters (-minQ/-minMapQ < 20) using BCFtools and SAMtools (Li 2011; Danecek et al. 2021).
To assess potential signals of selection, we compared nucleotide diversity (π) between candidate TE-control genes and background genes. The latter comprised 18,161 annotated protein-coding genes in the human reference genome (GRCh38), providing a genome-wide baseline for comparison. This approach allowed us to evaluate whether TE-control genes exhibit distinct patterns of genetic diversity relative to the broader set of coding regions. Genetic diversity values were calculated per genomic window. We then performed a Wilcoxon rank-sum test on the median π values between candidate genes and a randomly sampled subset of equal size from the background genes, repeated over 1,000 iterations to ensure robustness. For each population, we also calculated two summary statistics of genetic diversity: Tajima's D and Fay's H. Tajima's D, Fay and Wu's H and nucleotide diversity (π) were all calculated in sliding windows, with a window size of 50 kilobases (kb) with a step size of 10 kb, using the software ANGSD (Korneliussen et al. 2014). We ranked the values for each summary statistics, and each population separately, then assigned the minimum rank value to each gene, considering all the windows encompassing its genomic coordinates.
We found no significant difference in lengths between candidate genes and the rest of the human genes. To test for signals of selection, we ranked the calculated summary statistics within each population. For each gene, we identified windows that fell into the extreme 5% of ranked values for Tajima's D and Fay and Wu's H (Table S5). We then tagged the TE control genes that fell in those lower values and visualized the frequency with which those genes are found across the multiple populations. The analysis was conducted using RStudio and R version 4.2.2 and utilized the ggplot2 package (Wickham 2016) for visualization.
Statistical Analysis
To evaluate whether positively selected sites are enriched in IDRs, we applied a binomial test, using the counts of sites falling in IDRs as the number of successes, the number of total selected sites as trials, and the ratio between all IDR lengths divided by the length of all analyzed regions as the probability of success. Wilcoxon rank sum and Kendall's correlation tests were performed in the R v.4.0.5 environment.
Supplementary Material
evag059_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abramson J, et al Accurate structure prediction of biomolecular interactions with Alpha Fold 3. Nature. 2024:630:493–500. 10.1038/s 41586-024-07487-w.38718835 PMC 11168924 · doi ↗ · pubmed ↗
- 2Afanasyeva A, Bockwoldt M, Cooney CR, Heiland I, Gossmann TI. Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res. 2018:28:975–982. 10.1101/gr.232645.117.29858274 PMC 6028134 · doi ↗ · pubmed ↗
- 3Aktaş T, et al DHX 9 suppresses RNA processing defects originating from the Alu invasion of the human genome. Nature. 2017:544:115–119. 10.1038/nature 21715.28355180 · doi ↗ · pubmed ↗
- 4Almeida MV, Vernaz G, Putman ALK, Miska EA. Taming transposable elements in vertebrates: from epigenetic silencing to domestication. Trends Genet. 2022:38:529–553. 10.1016/j.tig.2022.02.009.35307201 · doi ↗ · pubmed ↗
- 5Almojil D, et al The structural, functional and evolutionary impact of transposable elements in eukaryotes. Genes (Basel). 2021:12:918. 10.3390/genes 12060918.34203645 PMC 8232201 · doi ↗ · pubmed ↗
- 6Anisimova M, Bielawski JP, Yang Z. Accuracy and power of Bayes prediction of amino acid sites under positive selection. Mol Biol Evol. 2002:19:950–958. 10.1093/oxfordjournals.molbev.a 004152.12032251 · doi ↗ · pubmed ↗
- 7Aravin AA, et al A pi RNA pathway primed by individual transposons is linked to de novo DNA methylation in mice. Mol Cell. 2008:31:785–799. 10.1016/j.molcel.2008.09.003.18922463 PMC 2730041 · doi ↗ · pubmed ↗
- 8Arif A, et al GTSF 1 accelerates target RNA cleavage by PIWI-clade Argonaute proteins. Nature. 2022:608:618–625. 10.1038/s 41586-022-05009-0.35772669 PMC 9385479 · doi ↗ · pubmed ↗
