Parallel Evolution at the Regulatory Base-Pair Level Contributes to Mammalian Interspecific Differences in Polygenic Traits
Alexander S Okamoto, Terence D Capellini

TL;DR
This study shows that parallel evolution at the DNA level contributes to differences in complex traits like height and blood cell count across mammals.
Contribution
The study demonstrates that parallel evolution at the regulatory base-pair level influences mammalian interspecific variation in polygenic traits.
Findings
Alleles at SNPs linked to human height and RBC count are associated with interspecific variation in these traits across mammals.
Primate body size variation is linked to primate-specific genomic elements, while RBC count involves both ancient and recent regions.
SNP positions are flanked by conserved sequences, maintain synteny, and overlap transcription factor binding sites.
Abstract
Parallel evolution occurs when distinct lineages with similar ancestral states converge on a new phenotype. Parallel evolution has been well documented at the organ, gene pathway, and amino acid sequence level but in theory, it can also occur at individual nucleotides within noncoding regions. To examine the role of parallel evolution in shaping the biology of mammalian complex traits, we used data on single-nucleotide polymorphisms (SNPs) influencing human intraspecific variation to predict trait values in other species for 11 complex traits. We found that the alleles at SNP positions associated with human intraspecific height and red blood cell (RBC) count variation are associated with interspecific variation in the corresponding traits across mammals. These associations hold for deeper branches of mammalian evolution as well as between strains of collaborative cross mice. While…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig. 1
Fig. 2| Mammals | Primates | Primates (unique) | Monophyletic | CC mice | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Trait | Species ( | SNP positions ( | Sign | PGLS | Permulations | Species ( | SNP positions ( | Sign | PGLS | Permulations | SNP positions ( | Sign | PGLS | Permulations | Wilcoxon | Spearman |
| Alanine aminotransferase | 115 | 7,439 | + | 0.398 | 0.426 | 35 | 12,499 | + | 0.147 | 0.135 | 5,670 | + | 0.014 | 0.080 | 0.112 | 0.229 |
| Aspartate aminotransferase | 115 | 12,882 | − | 0.663 | 0.448 | 33 | 22,312 | + | 0.395 | 0.297 | 10,569 | + | 0.403 | 0.340 | 0.189 | 0.319 |
| Body size | 236 | 54,451 | + |
| 0.142 | 42 | 87,855 | − | 0.472 | 0.290 | 37,623 | + | 0.015 | 0.057 | 0.046 | 0.008 |
| Calcium | 117 | 6,899 | − |
| 0.072 | 35 | 11,009 | + | 0.187 | 0.389 | 4,647 | + | 0.063 | 0.330 | 0.019 | 0.534 |
| Creatinine | 116 | 16,365 | − |
| 0.049 | 35 | 26,007 | + | 0.452 | 0.357 | 10,891 | − | 0.370 | 0.350 | 0.484 | 0.703 |
| Eosinophil | 119 | 18,877 | + | 0.618 | 0.457 | 34 | 32,844 | − | 0.163 | 0.164 | 15,556 | − | 0.021 | 0.162 | 0.555 | 0.581 |
| Hemoglobin | 120 | 14,394 | − | 0.035 | 0.333 | 36 | 25,420 | + | 0.574 | 0.369 | 12,061 | + | 0.806 | 0.457 | 0.859 | 0.531 |
| Lymphocyte | 121 | 12,904 | − | 0.036 | 0.192 | 35 | 21,808 | − |
| 0.088 | 9,938 | − | 0.010 | 0.012 | 0.081 | 0.221 |
| Red blood cell | 118 | 18,836 | + |
| 0.062 | 36 | 30,397 | + | 0.015 | 0.063 | 12,948 | + |
| 0.005 | 0.001 | 0.013 |
| Urea | 118 | 6,119 | + | 0.915 | 0.487 | 35 | 9,361 | + | 0.002 | 0.061 | 3,684 | + | 0.009 | 0.113 | 0.044 | 0.424 |
| White blood cell | 122 | 15,176 | − | 0.113 | 0.203 | 36 | 27,356 | − | 0.164 | 0.236 | 13,449 | − | 0.100 | 0.205 | 0.439 | 0.953 |
| Trait | RSID | Position (hg38) | Alleles | Alt effect on trait | Wilcoxon | PhyloP | Variant type | Syntenic genes |
| Human/mouse TF motifs |
|---|---|---|---|---|---|---|---|---|---|---|
| BodySize | rs57744001 | chr6:35602129 | C/T | Decreasing | 0.0233 | 5.98319 | intron_variant | TCP11, SCUBE3, DEF6, PPARD, FANCE, RPL10A, TEAD3, TULP1, FKBP5, ARMC12, CLPSL2, CLPS, LHFPL5, SRPK1, SLC26A8, MAPK14 | UHRF1BP1, DEF6, C6orf106, ZNF76, SRPK1, SNRPC, RP1-179N16.6, RPL10A | ANDR, GCR, MLXPL, NR1I3, NR6A1 |
| BodySize | rs9463087 | chr6:45433971 | G/A | Decreasing | 0.0174 | 5.732524 | intron_variant | RUNX2, CLIC5 | RUNX2, SUPT3H | None |
| BodySize | rs1044977 | chr17:45149857 | T/C | Decreasing | 0.000000229 | 5.314619 | synonymous_variant | MEIOC, CCDC43, ADAM11, GJC1, EFTUD2, HIGD1B, CCDC103, GFAP, KIF18B, C1QL1, DCAKD, NMT1, ACBD4, PLCD3, HEXIM1, HEXIM2, FMNL1, MAP3K14, SPATA32, ARHGAP27 | DCAKD, HEXIM2, NMT1, CRHR1-IT1, LRRC37A4P, RP13-890H12.2, RP11-707O23.5, CTD-2020K17.3, DND1P1, ARHGAP27, SPATA32, RP11-798G7.5, FZD2, ACBD4 | None |
| BodySize | rs4636879 | chr15:76337972 | T/C | Decreasing | 0.00419 | 5.137571 | synonymous_variant | UBE2Q2, FBXO22, TMEM266, ETFA, ISL2, SCAPER | ETFA, TSPAN3, PSTPIP1, RP11-797A18.3, SCAPER, UBE2Q2, RP11-797A18.4, RP11-797A18.6, COMMD4 | BHE41, TFE3 |
| BodySize | rs36036508 | chr10:63267858 | A/G | Decreasing | 0.00762 | 4.698048 | intron_variant | ADO, EGR2, NRBF2, JMJD1C, REEP3 | NRBF2, REEP3 | None |
| RBC | rs2535923 | chr14:73058567 | A/T | Decreasing | 0.0405 | 3.875571 | 5_prime_UTR_variant | DCAF4, ZFYVE1, RBM25, PSEN1, PAPLN, NUMB, HEATR4 | DCAF4, RP4-647C14.2, RP11-109N23.6, ACOT4, ACOT1, RBM25 | None |
| RBC | rs78686363 | chr12:120720190 | G/A | Increasing | 0.00539 | 2.015857 | 3_prime_UTR_variant | SIRT4, PLA2G1B, MSI1, TRIAP1, GATC, SRSF9, DYNLL1, COQ5, RNF10, POP5, CABP1, MLEC, UNC119B, ACADS, SPPL3, HNF1A | MLEC, UNC119B | None |
| RBC | rs11248850 | chr16:113609 | G/A | Increasing | 0.0211 | 1.287619 | intron_variant | SNRNP25, RHBDF1, MPG, NPRL3 | NPRL3, DDX11L10, HBZ, HBM, ITFG3, LA16c-OS12.2, POLR3K, AXIN1 | TAL1 |
| RBC | rs143306381 | chr19:12785987 | A/G | Increasing | 0.00209 | 1.045095 | intron_variant | MAN2B1, DHPS, WDR83, WDR83OS, FBXW9, TNPO2, BEST2, HOOK2, JUNB, PRDX2, RNASEH2A, RTBDN, MAST1, KLF1, GCDH, SYCE2, FARSA, CALR, GADD45GIP1, RAD23A, NFIX,LYL1, NACC1, IER2, CACNA1A | DNASE2, GCDH, MAN2B1, CTD-2192J16.21, PRDX2 | PURA |
| RBC | rs12208785 | chr6:109381834 | A/G | Increasing | 0.0201 | 0.897905 | non_coding_transcript_exon_variant | None | SMPD2, FIG4, AKD1, ZBTB24, MICAL1 | None |
- —National Science Foundation Graduate Research Fellowship
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadio, Podcasts, and Digital Media
Introduction
Parallel evolution occurs when two lineages with a similar ancestral phenotype evolve similar adaptive solutions (Cerca 2023). Parallel evolution has been shown to occur in natural systems and in experimental evolution experiments at the level of organs, gene regulatory networks, and the amino acid sequences of individual genes (Tenaillon et al. 2012; Foote et al. 2015; Marcovitz et al. 2019; Farré et al. 2021; Yuan et al. 2021). These studies show that the molecular underpinnings of parallel phenotypes are most commonly observed when comparing gene pathways but can also occur at specific amino acid positions (Tenaillon et al. 2012). Parallel evolution in amino acid sequences has been shown for numerous phenotypes in mammals, including echolocation (Marcovitz et al. 2019), longevity (Farré et al. 2021), and marine habitat (Foote et al. 2015; Marcovitz et al. 2019; Yuan et al. 2021). Parallel evolution in noncoding regions has been less well documented, but has been shown to influence coat coloration in beach mice (Wooldridge et al. 2022), pelvic spine evolution in sticklebacks (Shapiro et al. 2004), and the evolution of flightlessness in birds (Sackton et al. 2019). These examples have focused on species with similar selective pressures, but in principle, parallel evolution can occur at orthologous nucleotide positions in noncoding genomic regions of divergent species in response to disparate selective pressures toward similar phenotypic changes (Orr 2005b).
While the probability of parallel evolution at any given nucleotide site across the genome is low, complex traits being often highly polygenic—i.e. controlled by thousands of functional genetic variants—therefore, present the potential for many sites to be subject to parallel evolution (Boyle et al. 2017). This is due in part to the findings that the effect sizes of individual variants underlying complex traits are generally low (i.e. they are not large effect loci) and, therefore, such functional sites have the possibility to evolve in parallel without drastic effects on phenotypic variation (unlike protein-coding variants). For example, human height is controlled by over 12,000 independent genomic regions, with nearly all individually impacting phenotypic variation at best minimally, yet how such variants impact size variation in other species (i.e. in parallel) remains unclear (Yengo et al. 2022). In these contexts, selective sweeps may fix the same variant in divergent lineages so long as that regulatory variant has a similar and advantageous effect on phenotype (see supplementary notes 1 and 2, Supplementary Material online).
Though regulatory regions generally evolve more rapidly than protein-coding regions and are not limited by the codon-code, promoters can evolve slowly and hundreds of enhancers are conserved in adult mammalian tissues such as the liver (Villar et al. 2015). This suggests that despite substantial regulatory turnover (Odom et al. 2007), thousands of regulatory elements are conserved when all tissues and development timepoints are considered. Within regulatory elements, transcription factor (TF) binding preferences are highly conserved in bilaterians (Nitta et al. 2015), limiting the potential mutational landscape of regulatory elements at the nucleotide level. This implies that some mutations will have the same phenotypic effect in divergent species. For example, SNP rs12821256 has been shown to alter lymphoid enhancer-binding factor 1 (LEF1) binding at a KIT ligand (KITLG) gene, contributing to a blonde hair phenotype in humans and a lighter fur coloration when this base pair is experimentally altered in mouse (Guenther et al. 2014). Other examples include rs4911178 and rs6060369, which modify hip and knee morphology in humans and predispose individuals to joint disease, and similarly alter the morphology of the joint in single base pair replacement mice (Capellini et al. 2017). Differential binding of TFs also can drive gene expression differences between closely related species and deeply divergent lineages. For example, binding of the TF HNF4A is correlated with gene expression in the liver across mouse species (Wong et al. 2015) and some highly conserved enhancers between zebrafish and mice can drive heterochronic shifts in Hox gene expression in transgenic animals leading to dramatic phenotypic outcomes (Gérard et al. 1997).
The recent explosion in the number of high-quality human and more generally, mammalian genomes and their use in alignment algorithms allows for genome-wide exploration of parallel evolution at the nucleotide level. In this study, we investigated the potential for parallel evolution in the regulation of 11 polygenic traits across mammals using the Zoonoomia alignment of 241 mammalian genomes (Zoonomia Consortium 2020) and common single-nucleotide polymorphism (SNP) positions associated with those traits in Genome-Wide Association Studies (GWAS) in humans (see supplementary note S2, Supplementary Material online). We found that the human alleles at SNP positions associated with height and red blood cell (RBC) count variation are associated with body size and RBC variation between mammalian species. This association holds for deeper branches of mammalian evolution as well as between strains of collaborative cross (CC) mice. Furthermore, we provide evidence that the SNP positions driving this signal are flanked by conserved sequences, are in synteny with target genes, and may impact TF binding.
Results
To evaluate the potential for SNP positions underlying variation in human traits to have evolved in parallel in other mammalian species, we calculated a genotype score for 11 independent complex traits for which we had access to consistent, extensive data across a large number of mammalian clades and could be directly compared across species (see Materials and methods, Table 1). These traits include body size as well as ten blood chemistry traits, such as RBC and urea concentration. For each trait, we calculated a genomic score for >100 species based on their alignment to positions in the human genome significantly associated with variance in that trait by GWAS. Only genomic positions which were likely to be informative were considered (see Materials and methods for detailed filtering steps). Briefly, the genotype score was calculated for each species for each trait as the proportion of a positions matching the human trait increasing allele to those matching either human allele (only biallelic human SNP positions were considered) across the genomic positions associated with that trait. The ranks of these genotype scores were then tested for a positive relationship with the phenotypic rank of each species for that trait using a phylogenetic generalized least-squares (PGLS) regression. Two traits had a significant positive relationship across mammals: body size (P < 1.3e−5; Fig. 1a) and RBC (P < 3.2e−6; Fig. 2a; see Table 1, supplementary figs. S1 to S9, Supplementary Material online). Out of the 11 traits investigated, these two traits have the highest heritability (h^2^) in the UK BioBank data. In order to account for bias introduced by the underlying tree structure of the data, the phenotype ranks were randomly assigned to each node using a Brownian model of evolution 1,000 times to calculate an empirical P-value for each phenotype using the permulation framework (Saputra et al. 2021; see Materials and methods). This resulted in PGLS for both body size (P = 0.142) and RBC (P = 0.074) falling above the standard significance threshold. This suggests that there is a strong phylogenetic component to these associations, i.e. many instances of parallel evolution occurred deeper in the phylogeny rather than between tip species (see below).
PGLS regression of body size rank against the genomic score computed using mammal conserved SNP positions a), primate conserved SNP positions b), or primate unique SNP positions c). In a), squares represent primate species, open circles represent non primate mammals. Silhouettes of the common pipistrelle bat and common minke whale represent small and large mammals, respectively (not to scale). Silhouettes of the gorilla and mouse lemur represent the smallest and largest nonhuman primates, respectively (not to scale). d) Comparison of the absolute magnitude of body size change for internal branches with a positive residual of matches (allele effect direction matches phenotypic change) to mismatches compared to those with a negative residual. Comparison of the distance to TSS e) and PhyloP scores f) for candidate causative SNP positions which overlap a conserved candidate cis-regulatory element versus those which do not. g) Linear regression of body size rank against genomic score computed for CC mouse lines. All silhouettes from PhyloPic. Colors of a), b), and c) as in supplementary fig. S10, Supplementary Material online.
PGLS regression of RBC rank against the genomic score computed using mammal conserved SNP positions a) or primate conserved SNP positions b). In a), closed circles represent primate species, open circles represent nonprimate mammals. Colors as in Fig. 1. c) Comparison of the absolute magnitude of RBC change for internal branches with a positive residual of matches (allele effect direction matches phenotypic change) to mismatches compared to those with a negative residual. Comparison of the distance to TSS d) and PhyloP scores e) for candidate causative SNP positions which overlap a conserved candidate cis-regulatory element versus those which do not. f) Linear regression of RBC rank against genomic score computed for CC mouse lines.
We next repeated this analysis using just primates since a much larger portion of the human genome aligns to and is conserved only within primates (supplementary fig. S10, Supplementary Material online). This approach allowed for more SNPs to be used in computing genotype scores. RBC (P < 0.015, Ppermulations = 0.063, Fig. 2b) and urea concentration (P < 0.003, Ppermulations = 0.061, supplementary fig. S8B, Supplementary Material online) were found to have significant, positive relationships across primates while body size showed no relationship (P = 0.47, Ppermulations = 0.31; Fig. 1b). While mammals overall had a correlation between genotype and phenotype for body size, primates exhibited a strongly negative correlation within this analysis (Fig. 1a). When analyzed separately using the larger set of SNPs which aligned within primates, however, primate species showed no relationship between genomic score and body size (Fig. 1b), suggesting the larger SNP set has different properties than the smaller set conserved across all mammals. To explore this further, we extracted the SNPs which were alignable in primates but not mammals (henceforth “primate unique SNPs”) and reran the PGLS using this reduced set. The primate unique set showed a significant positive relationship for body size (P = 0.015, Ppermulations = 0.057, Fig. 1c), suggesting that sequences specific to the primate lineage might be more relevant to primate body size evolution than more deeply conserved sequences. The primate unique sets PGLS for RBC (P = 5.0 e−5, Ppermulations = 0.005, Table 1) and alanine aminotransferase (P = 0.014, Ppermulations = 0.080) also exhibited significant positive relationships.
Enrichment of Signal in Internal Branches
Since closely related species are often phenotypically similar, many of the genetic changes underlying that shared trait likely became fixed on the branch leading to their last common ancestor. Therefore, if parallel evolution occurs between species as shown in the previous analyses, internal branches with high trait change should be enriched for nucleotide substitutions which drive a corresponding trait change in humans. To address this hypothesis, we identified all the human SNP positions with a fixed substitution in a single mammalian lineage (and not present in any other species) and then tested for an association between an excess of fixed substitutions with the predicted direction of effect and the absolute amount of predicted trait change for each internal tree branch. Using this approach, we found significant associations for both body size (P = 0.046, one-sided Wilcoxon rank-sum, Fig. 1d) and RBC (P < 0.001, one-sided Wilcoxon rank-sum, Fig. 2c), as well as for calcium (P = 0.019, one-sided Wilcoxon rank-sum, supplementary fig. S3d, Supplementary Material online) and urea (P = 0.044, one-sided Wilcoxon rank-sum, supplementary fig. S8d, Supplementary Material online).
Identification of Putatively Causative Positions
To identify potential causative variants underlying parallel evolution, we sought to determine which SNP positions underly the correlation between species traits and genome-predicted trait values using PGLS. To do this, we used a Wilcoxon rank-sum test to identify positions where species with a given base pair have different phenotype ranks and then tested whether the sign of these positions matches the effect direction as predicted by GWAS for the corresponding trait in humans (see Materials and methods). This resulted in 12,255 and 992 putatively causative variants for body size and RBC, respectively. For body size, there is a slight enrichment (19 positions) in the number of putative causative variants compared to the number of positions with alleles significantly associated with the trait but in the opposite direction to that seen in humans. RBC shows a slightly reduced set of putatively causative variants compared to mismatches (32 less positions). A high number of mismatches in both cases are expected due to the strong phylogenetic signal present in the dataset so many positions will be associated with trait values by chance. To restrict our results to the positions most likely to be biologically meaningful, we filtered the putative causative position sets by overlapping them with candidate cis-regulatory regions (cCREs) identified in both human and mouse (see Materials and methods). This filtering step produced candidate causative sets of 323 positions for body size (supplementary table S2, Supplementary Material online) and nine positions for RBC (supplementary table S3, Supplementary Material online). Compared to the removed positions, these refined sets are significantly closer to transcription start sites (TSSs; body size, P < 2.2 × 10^−16^, RBC, P < 0.005, one-sided Wilcoxon rank-sum, Figs. 1e and 2d) and have higher local PhyloP scores (body size P < 2.2 × 10^−16^, RBC, P < 0.04, one-sided Wilcoxon rank-sum) for both body size and RBC (Figs. 1f and 2e). The mean proximity to the nearest TSS for both refined sets is smaller than the average for all cCRE elements (body size mean = 4,461 base pairs versus ENCODE cCRE mean = 6,873, RBC mean = 2,773, VISION cCRE mean = 3,071), suggesting this enrichment is not fully explained by the cCRE filtering step. If causative positions fall within cis-regulatory elements, the positions should remain proximate to the target gene in all species. To test this, we identified all genes with a TSS within 500 kb of the candidate positions in five species with well-annotated genomes: human, mouse, dog, domestic cattle, and the crab-eating macaque. We found that many of our candidate positions remain in synteny with a least one gene across these five species (RBC 4/9 positions; body size 143/310 positions; Table 2, supplementary tables S2 and S3, Supplementary Material online). These SNPs were generally closely associated with a gene, sitting either in an intron, or directly up or downstream (supplementary fig. S11, Supplementary Material online). Of these positions, many have been previously identified as cis-eQTLs in human blood (Võsa et al. 2021) for at least one of the syntenic genes (RBC 4/4 positions; body size 94/143 positions). While few SNPs have been functionally tested, one of the identified SNPs, rs11248850, sits within a known enhancer element regulating α-globin genes and has been identified as one of two potential causative alleles for a number of hematic traits including RBC count using a massively parallel reporter assay (Raffield et al. 2018).
Switching between the allelic variants at each of these positions should result in differential binding of TF homologs in all species. Since high-quality TF binding motifs are only available for human and mouse, we tested the variants and 10 base pairs of flanking sequence on either side in human and mouse genomic backgrounds for predicted differential TF binding. We found that while some positions were predicted to be bound by a least one TF in both species, our variants were only rarely predicted to substantially disrupt TF binding (supplementary tables S2 and S3, Supplementary Material online). Subtle alterations to TF binding are consistent with the small effect sizes of these variants in humans and can still be biologically relevant.
Validation Using Mouse Models
If the polymorphisms underlying trait variation in humans can partially explain differences in average trait values across species, then these same polymorphisms should also help contribute to variation within other species. To test this hypothesis, we utilized data from the mouse CC, a project designed to create a population of mice with levels of genetic variation comparable to natural populations, such as humans (Philip et al. 2011). Phenotypic and genotypic data for 18 CC strains were obtained (Mao et al. 2015; The Jackson Laboratory n.d.) and these SNPs were filtered by position against the mammal SNP dataset used in this study (see Materials and methods). The ranked genotype score for each CC line was then calculated as described above for all mammals using this reduced set and correlated with the phenotypic rank of each strain using a one-tailed Spearman's test. The two genotype–phenotypes pairs which were correlated across mammals—body size (P = 0.0086, R^2^ = 0.56, Fig. 1g), and RBC (P = 0.013, R^2^ = 0.52, Fig. 2f)—were the only two traits to show a significant, positive relationship in the CC mice. Using this approach, we identified seven putative causative variants for body size and five for RBC in mice. Two body size (rs3743673 and rs6060539) and one RBC (rs2009094) variant were identified as significantly associated with the trait both across mammals and in CC mice. Of the body size positions identified in CC mice, two fall within quantitative trait loci (QTL) intervals for body weight in the large Gough Island mouse (Gray et al. 2015).
One identified body size variant (rs6060369, C/T) resides in a highly conserved hind limb enhancer element (termed R4) in the Growth Differentiation Factor Five (GDF5) cis-regulatory locus. R4 enhancer loss-of-function in mice significantly decreases tibial length at 1 year by markedly decreasing GDF5 expression in vivo as well as in vitro in human chondrocytes (Richard et al. 2020). In the same direction of effect we find here, the shorter height variant (T allele) in this enhancer significantly decreases GDF5 expression in vivo in mouse chondrocytes from the hind limb using allele-specific expression studies, and significantly decreases reporter gene activation in mouse and human chondrocytes compared to the tall height variant (C allele; Richard et al. 2020; Muthuirulan et al. 2021). Moreover, the shorter height variant significantly decreases functional in vivo binding of a key hind limb TF PITX1 (Richard et al. 2020; Muthuirulan et al. 2021), highly conserved in functionality across vertebrates and critical for both body size and hind limb length (Lanctôt et al. 1999; Shapiro et al. 2004).
Discussion
Evolutionary adaptation can follow myriad genetic paths, but these possibilities are limited by existing genetic architecture (i.e. regulatory elements and transcriptions factors) and variation therein. While large effect variants may initially exhibit the strongest response to selection—as is often the case in domestication (Plassais et al. 2022)—selective sweeps can act on many loci of smaller effect to reach the phenotypic optimum (Orr 2005a). Though similar selective pressures will rarely lead to identical molecular changes (Thomas et al. 2017), for highly polygenic traits, whose many causative loci blanket the genome, and sufficiently large numbers of species, the signature of deeply shared genetic architecture can be identified.
In this work, we show that some of the alleles which contribute to human intraspecific differences in RBC and body size are correlated with the corresponding traits across mammals. This agrees with previous research showing that while relatively rare against a backdrop of high regulatory turnover, some regulatory elements maintain deeply conserved function within vertebrates (Schmidt et al. 2010). It makes sense that while some ultra-conserved elements seem intolerant of any variation (Christmas et al. 2023), other conserved elements are potentially tunable and can be used over and over again by evolution to drive similar trait changes. The SNP positions we identify as potentially causative fall into this category, exhibiting high local conservation across the alignment, target genes in synteny, and the potential to alter TF binding relevant to the trait. Our findings also highlight the various paths evolution takes to achieve the same phenotype in different lineages. For example, body size evolution across mammals in general seems to utilize different genomic regions than within the primate clade. That is not to say that many of the variants associated with body size in mammals would not be expected to function similarly in other primates including humans but rather that the variants evolved in parallel in mammalian evolution do not seem to have evolved in parallel within the primate clade. This could be a product of chance, as the mammalian-conserved variants were not available for selection to act upon during major body size transitions within the primate lineage, potentially due to the substantial transposable element-mediated rewiring of the ancestral primate genome and other clade-specific evolutionary constraints and pressures that have manifested since the deep divergence time between primates and other mammalian orders (Andrews et al. 2023). It could also relate to effect size, since the range of body sizes observed in mammals is much larger than in primates, or differences in the biological processes driving body size evolution in different clades (e.g. rate of growth versus duration of growth; Atchley and Hall 1991). In contrast, RBC evolution in primates has used both more recently evolved elements as well as deeply conserved ones.
We failed to identify parallel regulatory evolution in the nine other investigated traits; one explanation for this result is that the methods employed in this study may only have sufficient power to detect parallel evolution in traits which are both highly polygenic and highly heritable, such as body size and RBC. Additionally, the underlying biology of some traits, such as those related to the immune system, may be less likely to evolve in parallel across mammals due to the coevolution of each species with its specific pathogens facilitating greater divergence in regulatory architecture. As new mammalian genomes become available, future research may be able to identify additional parallelly evolving regions for a greater range of traits.
Evolution has produced thousands of extant mammals, each with a unique solution to creating a viable organism using a similar set of genetic tools. Conservation scores have proven to be an effective tool for identifying functional genomic elements (Christmas et al. 2023; Sullivan et al. 2023; Xue et al. 2023) but variable parts of the genome can be similarly informative (Mangan et al. 2022; Keough et al. 2023). Our study highlights the use of interspecific patterns of variation to understand the shared biology of complex traits across Mammalia. Since linkage disequilibrium patterns are poorly conserved even between humans and chimpanzees (Auton et al. 2012), an important implication of these results is that interspecific variation in human SNP positions can potentially be used to identify causative variants in humans, such as rs6060369 for body size and rs11248850 for RBC.
Materials and Methods
Genotype–Phenotype Pair Identification
GWAS traits used in the UK BioBank were filtered for high heritability (h^2^ > 0.1), large sample size (neff > 300,000), inclusion of both sexes, and high confidence (supplementary tables S1 and S4, Supplementary Material online). Traits were manually filtered to avoid highly redundant traits (e.g. lymphocyte percentage vs. lymphocyte count), traits calculated based off other traits (e.g. trunk predicted mass), and traits which are challenging or impossible to evaluate comparatively between species (e.g. tense/high strung).
The resulting GWAS trait list was further reduced to those for which phenotypic data were available for at least 100 species. Phenotypic data for body size primarily came from Pantheria (Jones et al. 2009) but were supplemented by additional sources (see supplementary Materials and Methods, Supplementary Material online). All other phenotypic data came from the Species360 Zoological Information Management System Expected Test Results Database (“ZIMS Expected Test Results” 2022), which includes expected blood chemistry values based on hundreds of tests of healthy zoo animals. When multiple options were available for a phenotype (such as automated versus manual counting), the method with the largest species sample size was selected. A final round of manual curation involved a literature review to identify traits with a highly variable underlying composition across species, such as total cholesterol (Kaabia et al. 2018) and total protein (Hattingh et al. 1983; Windberger et al. 2003), which were excluded from the analysis.
This filtering approach resulted in 11 distinct traits with high-quality human GWAS data and a large set of mammalian phenotypic data (Table 1). Standing height GWAS in humans was used as a proxy for overall body size when compared to other species. Adult body mass data were used instead of head-to-body lengths because the two are highly correlated (Spearman correlation, R = 0.987) within the Pantheria database, and body weight data are available for more species, increasing the total sample size for this trait.
Initial SNP-by-Species Matrix Construction
The positions of the 13,791,467 SNP variants used in the UK BioBank were extracted for the 240 mammal species from the Zoonomia whole-genome alignment (Zoonomia Consortium 2020) using custom Python scripts to produce a SNP-by-species matrix containing the base pair calls. This dataset was then filtered to remove any SNP positions falling within ENCODE blacklist regions since these regions are known to be problematic across various animal genomes (Amemiya et al. 2019). Furthermore, only regions on the human autosomes were retained due to the more complicated inheritance patterns and sex-specific differences in utilization of the genetic information on the sex chromosomes. Finally, only positions with an unambiguous single base pair call were retained (>99% of all alignments were unambiguous), since many mammal genomes are based on a single individual and no data exist on SNP frequencies within these populations. For the purposes of this study, we assume that all base pair calls are fixed in that species.
To further reduce the size of the matrix, only biallelic human SNPs were retained. Since base pairs in other species which do not match either the human reference or alternate allele cannot be interpreted based on human GWAS summary statistics, all base pairs in the alignment which do not match a human allele were removed. In addition, positions with a fixed base pair across all species were also removed since perfectly conserved positions cannot evolve in parallel. Lastly, because any potentially causative SNPs must be distinguishable from randomly fixed base pairs in a single lineage, we removed any positions with a single substitution which distinguished a single clade (i.e. all species with that allele were more closely related to each other than any other species). These SNP positions were used for a separate analysis (see below).
The resulting matrix was filtered to produce two submatrices with all SNPs that have an alignment of 85% across all primate species or 75% across all mammal species. These thresholds were chosen to balance the need to ensure similar alignment percentages across major taxonomic groups while not discarding SNPs that may have failed to align in some species for artifactual reasons. This allowed for large numbers of SNPs to be included in the analysis even if they were missing in some species, either due to technical artifacts or a real evolutionary loss of alignable sequence. A stricter threshold was chosen when only primates were considered because at 75%, many SNPs were alignable in all haplorrhines but not strepsirrhines, which would have added a strong phylogenetic confounder to the results. This filtering step produced matrices with 2,598,368 SNPs for all mammals (240 species) and 4,337,809 SNPs for all primates (42 species; supplementary fig. S10, Supplementary Material online).
Genomic Score Calculation and Trait Association Testing
Due to the substantial variation in prediction accuracy of GWAS polygenic risk scores between (Martin et al. 2019) and within human ancestry groups (Mostafavi et al. 2020), the estimated effect size of individual loci (beta scores) is likely meaningless when applied to other species. Therefore, only the direction of the effect for each allele was considered and fixed at 1 or −1 because signage is highly consistent (>90%) across human ancestry groups (Sakaue et al. 2021). The genomic score for each species was calculated for all SNP positions in the matrix which have been identified at a genome-wide significance threshold of P < 5 × 10^−8^ as the proportion of trait increasing allele matches to all matches to either the trait increasing or decreasing human allele. The genomic scores were then converted into ranks and regressed against the trait ranks using a PGLS approach (Symonds and Blomberg 2014). Rank conversion was used to standardize the scale of the variables, preventing outliers from disproportionately influencing the regression and reducing the impact of small inaccuracies in trait values.
The phylogenetic relationships between the species in the alignment were obtained using the TimeTree database (Kumar et al. 2017). Because many traits are associated with phylogeny, we tested the associations identified using standard PGLS by utilizing a permulations approach which maintains the tree topology but redistributes the phenotypes using a Brownian motion model of evolution to ensure similar clustering of traits and generate a more accurate null P-value distribution (Saputra et al. 2021). This approach was repeated for the mammalian-conserved SNP positions, the primate-conserved SNP positions, and the primate unique SNP positions (those conserved in primates but not mammals).
To calculate genomic scores in the CC mice, a new matrix was generated by subsetting the mammalian-conserved SNP matrix to those positions which have the same alleles in the corresponding positions of the mouse genome. These positions were identified by lifting over the human SNP positions to mm10 (Kent et al. 2002) and extracting the corresponding mouse SNP data from the Mouse Phenome Database (Bogue et al. 2023). Phenotypic data for 18 lines of CC mice were downloaded from the same source and averaged between the sexes (Mao et al. 2015; Bogue et al. 2023; The Jackson Laboratory n.d.). The cross strategy used to make the CC lines prevents the lines from having a clear phylogenetic relationship so a standard linear regression was used instead of PGLS to compare the genomic score and trait ranks for the CC lines. Significance was assessed using a one-tailed Spearman's test.
Investigation of Evolution in Internal Branches
To test whether clade-specific substitutions are associated with major trait changes on the ancestral branch for that clade, we identified all SNPs which had a base pair fixed in a particular clade and not present in any other species (“monophyletic SNPs,” excluded in the previous analysis). For each genotype–phenotype pair, the degree of phenotypic change was calculated by subtracting the estimated trait value of the direct ancestor from that of the descendent node at each internal node of the tree. The estimated trait values were generated using the phytools fastANC function (Revell 2012) on the ranked phenotype data. For each internal branch, the monophyletic SNPs were intersected with the corresponding GWAS significant SNP sets (P < 5 × 10^−8^) and the SNPs with a fixed allele on the branch which matched a human allele associated with the direction of change observed on that branch were counted. Since the total number of monophyletic SNPs was positively associated with longer branches which tended to have large trait change values, we also counted the number of mismatches where the fixed monophyletic SNP at a GWAS significant position matched the human allele associated with the opposite direction of the trait change. A linear correlation between the number of matching and mismatched positions for each internal node was calculated and the relationship between a positive or negative residual and absolute amount of trait change on each branch was evaluated using a one-sided Wilcoxon rank-sum test.
Identification of Putatively Causative Positions
Since only body size and RBC showed a consistent signal, only these two traits were investigated further. To identify the SNP positions which are potentially causative for each trait, we used a Wilcoxon rank-sum test to identify positions where species with the human reference allele have different phenotype ranks compared to those with the human alternative allele and then filtered out the positions where the trait-allele association matches the effect direction as predicted by GWAS for the trait in humans.
The potentially causative SNP position sets were filtered for overlap with candidate cis-regulatory elements (cCREs) which are present in both human and mouse. For body size, cCRE sets for human and mouse were obtained from ENCODE (The ENCODE Project Consortium 2020) and the intersection was performed using bedtools intersect. Since cCRE data related to hematopoiesis have been specifically generated in human and mouse (Xiang et al. 2020), the VISION cCRE set was used instead for RBC count, following the same overlap approach. For each trait, the PhyloP scores for each SNP position padded by 10 base pairs on either set were calculated using the Zoonomia PhyloP scores (Sullivan et al. 2023) and an elevation in average PhyloP score for the positions overlapping a cCRE element compared to those without an overlap was calculated using a one-sided Wilcoxon rank-sum test. Similarly, the minimum distance of each position to the nearest TSS as defined in the refTSS database (Abugessaisa et al. 2019) was calculated and the positions overlapping a cCRE element were compared to those without an overlap using a one-sided Wilcoxon rank-sum test.
To determine possible target genes regulated by the identified cCREs, each SNP position was expanded by 500 kb in both directions and this window was overlapped with gene positions from the NCBI refSeq database for hg38 using bedtools intersect. If the positions in conserved cCREs are cis-regulatory elements for the same gene in all mammals, the positions should be in synteny with their target gene in all mammalian species. The candidate causative positions were, therefore, lifted over to the mouse (mm10), crab-eating macaque (macFas5), dog (canFam6), and domestic cattle (bosTau9) genomes using the appropriate UCSC liftover files. The refSeq list of gene positions was downloaded for each species and those associated with each position were determined as described in human above. The resulting gene lists in synteny with a given SNP position in each species were intersected in R. Finally, *cis-*expression quantitative trait locus (*cis-*eQTL) data from human blood samples (Võsa et al. 2021) were used to further tie candidate SNP positions to target genes.
To determine whether the variants at these positions have the potential to disrupt TF binding across mammals in these positions in a similar manner, we first used the motifbreakR (version 2.14.2) snps.from.rsid function (Coetzee et al. 2015) with information from the packages SNPlocs.Hsapiens.dbSNP155.GRCh38 and BSgenome.Hsapiens.UCSC.hg38 to get the human sequence surrounding each SNP. Motif enrichments for each sequence variant were identified using the motifbreakR function with the following parameters: filterp = TRUE, threshold = 1e−4, method = “ic,” bkg = c(A = 0.25, C = 0.25, G = 0.25, T = 0.25), BPPARAM = BiocParallel::bpparam(). This step was repeated with the added parameter show.neutral = True to identify all TFs binding to the target region, regardless of any difference in binding between alleles. To identify motif enrichments in mouse, the regions were lifted over from hg38 to mm10 using the R liftOver package version 1.22.0 (Bioconductor Package Maintainer 2023). Mouse reference sequences were extracted using the BSgenome.Mmusculus.UCSC.mm10 package (version 1.4.3). If the mouse reference base pair at a given position matched either human allele, the other human allele was used as the alternate allele. If the mouse reference allele does not match either human allele, both alleles were listed as alternative alleles. The position weight matrices for either human or mouse came from HOCOMOCOv10 (Kulakovskiy et al. 2018). The TFs binding to the variants at a given position in both species were intersected to find candidates with potentially conserved function in both species.
Supplementary Material
msae157_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abugessaisa I , Noguchi S, Hasegawa A, Kondo A, Kawaji H, Carninci P, Kasukawa T. ref TSS: a reference data set for human and mouse transcription start sites.J Mol Biol. 2019:431(13):2407–2422. 10.1016/j.jmb.2019.04.045.31075273 · doi ↗ · pubmed ↗
- 2Amemiya HM , Kundaje A, Boyle AP. The ENCODE blacklist: identification of problematic regions of the genome.Sci Rep. 2019:9(1):9354. 10.1038/s 41598-019-45839-z.31249361 PMC 6597582 · doi ↗ · pubmed ↗
- 3Andrews G , Fan K, Pratt HE, Phalke N, Karlsson EK, Lindblad-Toh K, Gazal S, Moore JE, Weng Z, Andrews G, et al Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science. 2023:380(6643):eabn 7930. 10.1126/science.abn 7930.37104580 · doi ↗ · pubmed ↗
- 4Atchley WR , Hall BK. A model for development and evolution of complex morphological structures.Biol Rev Camb Philos Soc. 1991:66(2):101–157. 10.1111/j.1469-185X.1991.tb 01138.x.1863686 · doi ↗ · pubmed ↗
- 5Auton A , Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J, et al A fine-scale chimpanzee genetic map from population sequencing. Science. 2012:336(6078):193–198. 10.1126/science.1216872.22422862 PMC 3532813 · doi ↗ · pubmed ↗
- 6Bioconductor Package Maintainer . 2023. Lift Over: changing genomic coordinate systems with Rtracklayer::Lift Over. https://www.bioconductor.org/help/workflows/lift Over/.
- 7Bogue MA , Ball RL, Philip VM, Walton DO, Dunn MH, Kolishovski G, Lamoureux A, Gerring M, Liang H, Emerson J, et al Mouse phenome database: towards a more FAIR-compliant and TRUST-worthy data repository and tool suite for phenotypes and genotypes. Nucleic Acids Res. 2023:51(D 1):D 1067–D 1074. 10.1093/nar/gkac 1007.36330959 PMC 9825561 · doi ↗ · pubmed ↗
- 8Boyle EA , Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic.Cell. 2017:169(7):1177–1186. 10.1016/j.cell.2017.05.038.28622505 PMC 5536862 · doi ↗ · pubmed ↗
