Natural selection exerted by historical coronavirus epidemic(s): comparative genetic analysis in China Kadoorie Biobank and UK Biobank

Sam Morris; Kuang Lin; Iona Y. Millwood; Canqing Yu; Jun Lv; Pei Pei; Liming Li; Dianjianyi Sun; George Davey Smith; Zhengming Chen; Robin G. Walters

PMC · DOI:10.1186/s12864-025-11876-4·October 21, 2025

Natural selection exerted by historical coronavirus epidemic(s): comparative genetic analysis in China Kadoorie Biobank and UK Biobank

Sam Morris, Kuang Lin, Iona Y. Millwood, Canqing Yu, Jun Lv, Pei Pei, Liming Li, Dianjianyi Sun, George Davey Smith, Zhengming Chen, Robin G. Walters

PDF

Open Access

TL;DR

The study finds evidence that ancient coronavirus epidemics in East Asia led to genetic changes that can still be seen in modern populations.

Contribution

The study uses large-scale genetic data from East Asia and the UK to show evidence of historical coronavirus-driven selection in East Asians but not in Europeans.

Findings

01

Regions of long-range linkage disequilibrium were enriched for coronavirus VIPs in the China Kadoorie Biobank but not in the UK Biobank.

02

Selection-scan methods like saltiLASSI confirmed the enrichment of VIPs in East Asian populations.

03

SARS-CoV-2 GWAS signals showed no population-specific selection patterns, consistent with being a novel virus.

Abstract

Pathogens have been one of the primary sources of natural selection affecting modern humans. The footprints of historical selection events – “selective sweeps”– can be detected in the genomes of present-day individuals. Previous analyses of 629 samples from the 1000 Genomes Project suggested that an ancient coronavirus epidemic ~ 20,000 years ago drove multiple selective sweeps in the ancestors of present-day East Asians, but not in other worldwide populations. Using a much larger genetic dataset of 76,719 unrelated individuals from each of the China Kadoorie Biobank (CKB) and UK Biobank (UKB) to identify regions of long-range linkage disequilibrium, we further investigated signatures of past selective sweeps and how they reflect previous viral epidemics. Using independently-curated lists of human host proteins which interact physically or functionally with viruses (virus-interacting…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

VIP

Proteins2

Species3

Homo sapiens(human · species)Gammacoronavirus(genus)Severe acute respiratory syndrome coronavirus 2(no rank)

Diseases4

SARS-CoV-2 critical illness infection infectious disease

Figures2

Click any figure to enlarge with its caption.

Schematic to estimate the enrichment of different VIP classes for regions of selection. We began by using two different methods to identify regions of as candidate regions of selection, PCA HMM and saltiLASSI. a The first step of the PCA HMM is to perform a PCA on the subset of unrelated individuals and then identify regions of the genome which show distortions in PC loadings using an HMM-based algorithm (b). Panel (b) shows a spike in the loadings of one principal component, caused by a region of long-range LD. The other method, saltiLASSI, identifies regions of the genome which show a strong

Funding7

—National Natural Science Foundation of China
—National Key Research and Development Program of China
—Medical Research Council
—Wellcome Trust
—UK Medical Research Council
—Cancer Research UK
—British Heart Foundation

Keywords

GenomicsSelectionPathogensHumansComputational biology

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEvolution and Genetic Dynamics · Zoonotic diseases and public health · Genetic Associations and Epidemiology

Full text

Background

Pathogens and their associated diseases have been widespread across human history [1]. In particular, the transition from sparsely populated groups of hunter-gatherers to densely-packed farming communities in close vicinity to domesticated animals likely facilitated the spread of many novel pathogens from animals to humans, and then within and between human populations [2, 3]. Despite widespread and substantial improvements in sanitation and treatment of infectious diseases, pathogens were still responsible for about a quarter of global deaths in 2019 [4]. Thus, they are expected to have exerted substantial selective pressure on human populations throughout history; indeed, analysis of genetic data has suggested that pathogens represent the strongest selective effect on modern humans [5].

The impact of such past natural selection on the ancestors of modern humans can be observed in the genomes of present-day populations using a variety of statistical methods (e.g. Extended Haplotype Homozygosity [6], Population Branch Statistic [7], reviewed in [8]). These techniques have identified many immune-related loci inferred to have been targets of natural selection [9–12], supporting the hypothesis that pathogens play an important role in shaping patterns of human genetic variation. One such footprint of selection is known as a ‘selective sweep’: as an allele under positive selection rapidly increases in frequency within a population across generations, neighbouring alleles which are in linkage disequilibrium (LD) with the selected allele also rise to high frequency, erasing genetic diversity around the locus under selection [13, 14]. Such selective sweeps can be detected by scanning the genome to identify e.g. long-range homozygous haplotypes [6, 9] or significant distortions of the haplotype frequency spectrum [15].

One set of likely pathogen-related targets of selection are virus interacting proteins (VIPs), which are classes of human proteins known to physically interact with or provide functions essential for replication of particular viruses. Previous analyses using sequencing data from the 1000 genomes project [16] identified an enrichment of selective sweep signals at genes encoding coronavirus (a type of RNA virus) VIPs in East Asian (EAS) but not European-ancestry (EUR) populations. Conversely, no evidence for enrichment at genes for DNA-virus VIPs was found [12]. Together, these results imply one or more historic coronavirus epidemics, either localised to East Asia or with signature(s) not detectable in EUR (e.g. due to different demographic histories or higher levels of post-selection genetic drift).

In the past few decades, there have been multiple epi/pandemics related to novel coronaviruses (i.e. COVID-19, MERS and SARS), which likely arose from zoonotic transmission. However, there also exist several ‘seasonal’ coronaviruses which are endemic in human populations, such as HCoV-229E and HCoV-NL63 [17]. It is possible that these current seasonal coronaviruses originated as epidemics similar to the more recent epi/pandemics. Accordingly, the signals of selection identified by Souilmi et al. [12] may reflect selection events related to the ancestors of these endemic viruses, and that there were multiple different sweeps related to several different endemic viruses. The COVID-19 pandemic predominantly affected older individuals in terms of mortality, suggesting that its selective impact at the population level through reproductive fitness may be limited. However, recent studies have indicated that long COVID, potentially among younger as well as older individuals, can lead to pathologies in various physical systems, including cardiovascular, neurological, cognitive and immune [18]. Consequently, there may be a selective effect of long COVID mediated through its effect on these systems.

The finding that historical coronavirus epidemic(s) may have occurred in the ancestors of present-day EAS populations has important consequences for future studies on the effect of population-wide prior pathogen exposure on the risk of infection from novel diseases. Whilst the methodology employed by Souilmi, et al. gave statistically robust conclusions, their study was conducted on relatively small sample sizes (~ 500 individuals across 5 EAS populations). To better characterise these historical selective sweeps, larger scale studies are needed, in both EAS and other populations; increasing sample size is known to improve precision when detecting weaker/more ancient sweeps [19, 20].

We have sought to replicate and extend the reported findings using a much larger genetic dataset comprising sets of 76,719 unrelated individuals from each of the China Kadoorie Biobank (CKB) and UK Biobank (UKB). We identified regions of long-range linkage disequilibrium (LRLD) in each population, and found that VIPs for coronaviruses, but not DNA viruses, were enriched for overlap with LRLD in CKB. By contrast, we found no clear evidence for any VIP enrichment in UKB. These findings were supported by concordant results for VIP enrichment at genomic regions identified by a selection scan using a different approach, in which distortion of the haplotype frequency spectrum was used to detect signals of selection. Together, these results provide further strong supporting evidence that one or more historical coronavirus epidemics occurred specifically in East Asia.

Results

Virus-interacting protein classification

VIPs are proteins expressed in humans that have been shown to interact with viruses, either physically or by providing functions essential for viral propagation. Genes encoding these proteins may be subject to selection driven by viral epidemics. We used a set of proteins grouped into VIP categories, as previously defined by Souilmi et al. [12], based on low-throughput molecular methods and high-throughput mass-spectrometry (Fig. 1f, Supplementary Table S1). VIPs were classified based on i) whether they primarily interact with DNA or RNA viruses; ii) whether or not RNA virus VIPs interact with coronaviruses; and iii) whether or not coronavirus VIPs interact with SARS-CoV-2 viruses. In addition, we defined a separate subset of 42 SARS-VIPs previously identified as being potential sites of selection in past coronavirus epidemics [12] and which would be expected to be similarly identified in our analyses and which, therefore, could be used as a positive control to test the effectiveness of our analytical approach.Fig. 1. Schematic to estimate the enrichment of different VIP classes for regions of selection. We began by using two different methods to identify regions of as candidate regions of selection, PCA HMM and saltiLASSI. a The first step of the PCA HMM is to perform a PCA on the subset of unrelated individuals and then identify regions of the genome which show distortions in PC loadings using an HMM-based algorithm (b). Panel (b) shows a spike in the loadings of one principal component, caused by a region of long-range LD. The other method, saltiLASSI, identifies regions of the genome which show a strong distortion in the haplotype frequency spectrum in a sliding window approach (c) to calculate a selection scan test statistic. We then identified peaks of this test statistic (d). We then estimated the enrichment of each set of VIPs (e) with regions of selection (f). We calculated the empirical overlap between the regions of selection and different classes of VIPs (g) and then calculated whether this overlap is greater than expected by chance by permuting/bootstrapping the regions of selection across the genome to generate a null distribution (h)

PCA-based identification of long-range LD regions

Natural selection and other demographic processes can result in regions of LRLD in the genome. In previous work, to facilitate genotype PCA analysis of the CKB cohort, we identified such regions of LRLD using an approach similar to one previously used in UKB [21], by applying an iterative hidden Markov-model based algorithm to principal components (PCs) derived from genotypes of 76,719 unrelated CKB individuals (see Methods) [22]. Excluding the extended region of LD at the chromosome 6 HLA region (chr6:20-40Mbp), we identified 229 unique regions of LRLD (median length = 593.1Kbp, total length = 218.1Mbp) on the basis of distortions in the variant loadings of the top 11 PCs (i.e. those previously identified as being informative for geographic population structure in CKB) (Supplementary Table S2). Using the same approach for analysis of genotypes from a similar number of 76,719 randomly-selected unrelated white British individuals from UKB, applied to the top 5 (geographically-informative) PCs [23], we identified 326 LRLD regions (median length = 1070.0Kbp, total length = 518.77 Mbp) (Supplementary Table S3). Further sets of LRLD regions were defined based on splitting the CKB LRLD regions according to whether they were uniquely identified in CKB (n = 128) or they overlapped with UKB LRLD regions (n = 104) (Supplementary Table S4-5).

Enrichment of long-range LD at VIP genes

We hypothesised that if a particular class of VIPs were the target of natural selection, then the genes encoding those VIPs would overlap with regions of LRLD more often than expected by chance. To test this, we compared the observed overlap of VIPs with LRLD regions with empirical null distributions, derived using sets of “decoy” LRLD regions generated by repeatedly redistributing the LRLD regions randomly across each chromosome while retaining their size characteristics, as illustrated in Fig. 1. Table 1 shows the results of this analysis for different classes of VIP and different sets of LRLD regions. Consistent results were found for 3 different methods for scoring LRLD—VIP overlap– i) any overlap, ii) > 50% overlap, iii) total base-pair overlap (Supplementary Tables S6-7).Table 1. Enrichment of regions of Long-range Linkage Disequilibrium (LRLD) at Virus Interacting Protein (VIP) genesLRLD Selection Region SetCKBUKBCKB-onlyCKB + UKBVIPsenrichmentPenrichmentPenrichmentPenrichmentPAll (4768)1.13 (0.98–1.31)0.0470.98 (0.88–1.08)0.6731.15 (0.95–1.42)0.0761.11 (0.89–1.42)0.178RNA viruses (3495)1.19 (1.01–1.41)0.0181.01 (0.90–1.14)0.4291.25 (1.02–1.59)0.0161.12 (0.87–1.48)0.183DNA viruses (1273)1.03 (0.83–1.28)0.4190.92 (0.80–1.08)0.8520.93 (0.70–1.31)0.6871.12 (0.82–1.62)0.242CoV (394)1.50 (1.10–2.16)0.0041.04 (0.84–1.32)0.3741.89 (1.21–3.40)0.0011.11 (0.74–2.00)0.316non-CoV (3101)1.16 (0.99–1.38)0.0381.03 (0.91–1.16)0.3371.20 (0.97–1.54)0.0461.12 (0.86–1.48)0.195non-SARS (77)1.86 (1.00–4.33)0.0211.07 (0.71–1.88)0.3722.00 (1.00–8.00)0.0231.67 (0.62-Inf)0.208SARS (317)1.41 (1.02–2.16)0.021.04 (0.82–1.34)0.411.86 (1.13–3.71)0.0051.07 (0.65–2.14)0.424under-selection (42)2.50 (1.25–10.00)0.0051.50 (0.86–4.00)0.0832.00 (0.80-Inf)0.1093.00 (1.20-Inf)0.011Enrichment and P-values (one-tailed test) were determined for the frequency with which the genes encoding different classes of VIPs overlap with sets of LRLD regions, compared with random permutation of LRLD locations across the genome. Number in brackets denotes the number of VIPs in that class included in the analysis. Enrichment and 95% CIs were derived from the median and 2.5%/97.5% centiles of the null and P from the empirical 1-tailed test against the null. The HLA region (chr6:20-40Mbp) and VIP genes lying within it were excluded from analysis. Asterisks next to P-values denote significance after multiple testing adjustment (P < 0.05/(2*2.56), see Methods)

Compared to the null distribution, there was strong evidence in CKB for LRLD enrichment at loci encoding the subset of SARS-VIPs (n = 40 after exclusion of the HLA region) previously identified as likely sites of selection (enrichment ratio ER = 2.50; 95% CI 1.25–10.00; P = 0.005). This finding provides further population genetic evidence in support of the previous finding that one or more ancient coronavirus epidemics occurred in East Asia approximately 25,000 years ago, and indicates that, as expected, the identified regions of LRLD are enriched for signals of selection.

Using the same approach to test the other classes of VIPs, there was a strong signal in CKB that CoV-VIP genes (n = 394) are enriched for regions of LRLD (enrichment ratio, ER = 1.50; 95% CI 1.10–2.16; P = 0.004) relative to the null. This LRLD enrichment was further investigated by classifying VIPs according to whether they have been found to be related to SARS viruses, or only related to other types of coronavirus (i.e. endemic coronaviruses). Both classes of VIPs displayed LRLD enrichment in CKB, with somewhat greater enrichment in non-SARS CoV-VIPs (ER = 1.86, 95% CI 1.00–4.33; P = 0.021) compared to SARS CoV-VIPs (ER = 1.41, 95% CI 1.02–2.16; P = 0.020). There was also suggestive evidence for more moderate enrichment of LRLD at genes encoding non-CoV-VIPs (ER = 1.16; 95% CI 0.99–1.38; P = 0.038), and for RNA-VIPs overall (ER = 1.19; 95% CI 1.01–1.41; P = 0.018). Conversely, DNA-VIPs (n = 1,273) showed no enrichment for regions of LRLD (ER = 1.03; 95% CI 0.83–1.28; P = 0.419), again consistent with findings from the previous study by Souilmi et al. [12].

By contrast with CKB, in UKB we found no evidence for enrichment of LRLD near to genes encoding CoV-VIPs (P = 0.316), or for any other kind of VIPs (all P > 0.05) (Table 1). Furthermore, the LRLD enrichment observed for CKB was predominantly due to LRLD regions found only in CKB. For almost all classes of RNA VIP, enrichment for overlap with CKB-only LRLD was greater than for the main analysis while, conversely, overlap with LRLD regions identified in both CKB and UKB displayed less enrichment and/or was not statistically significant. The one exception was a threefold enrichment for the “under selection” SARS-CoV-VIP genes, although this was based on only 6 overlapping genes.

Detecting signals of selection using saltiLASSI

In addition to selective sweeps, regions of LRLD may also arise from demographic processes such as population bottlenecks [24, 25], potentially confounding the above analysis. To address this issue, we aimed to replicate our findings using putative signals of natural selection identified using an unrelated method, saltiLASSI [15], which detects regions of the genome that display substantial distortions of the haplotype-frequency spectrum (illustrated in Fig. 1c). Importantly, saltiLASSI is robust to false positives driven by e.g. demographic events. Applying saltiLASSI to the same sets of unrelated individuals from CKB and UKB, across the autosomes (again excluding the extended HLA region on chromosome 6), we identified 117 non-overlapping regions in CKB showing strong evidence of selection (median length = 175.2Kbp, total length = 35.3Mbp), and 118 regions in UKB (median length = 134.3Kbp, total length = 25.2Mbp) (Supplementary Tables S8-9). A total of 42 regions of selection overlapped between CKB and UKB, comprising 6.20Mbp of overlapping DNA.

We used the regions of selection identified by saltiLASSI as a ‘truth-set’ to determine the validity of the previously identified LRLD regions. Scoring LRLD regions in the order in which they were identified according to their overlap with saltiLASSI regions, ROC curves confirmed that the LRLD approach successfully identified regions of selection (Supplementary Fig. 1).

Enrichment of saltiLASSI regions at VIP genes

We again assessed the proximity of VIP structural genes to signals of selection, scoring for each VIP class the proportion lying within 10Kbp of a saltiLASSI-identified region (Table 2). For each of CKB and UKB, only 2.7% of DNA virus VIPs were close to at least one signal of selection, substantially fewer than for any other VIP class, consistent with the findings from our LRLD analysis and the previous finding of no evidence of selective sweep enrichment near DNA VIP genes [12]. By comparison, in CKB the proportion of CoV-VIP genes in close proximity to a saltiLASSI signal of selection was substantially larger (27/394, 6.9%; P = 5.6 × 10^–5^), with similar proportions for both SARS (6.9%; P = 1.1 × 10^–4^) and non-SARS (6.5%; P = 0.026) VIPs, in each case representing an enrichment of ~ 2.5-fold relative to DNA virus VIPs. Further, for the SARS-VIPs previously identified as under selection, the proportion of genes close to a saltiLASSI signal of selection was even larger (7/40, 17.5%; P = 5.5 × 10^–8^), a 6.6-fold enrichment. By contrast, in UKB any enrichment of genes in proximity to signals of selection relative to DNA virus VIPs was much more limited. Nevertheless, there was near twofold enrichment for CoV-VIPs (20/394, 5.1%; 1.9-fold; P = 0.012) and for SARS-VIPs (17/317, 5.4%; 2.0-fold; P = 0.0096). However, we note that in these analyses the counts of overlapping VIPs are relatively small and therefore small changes in the number of DNA VIP overlaps may have a large effect on the resulting P-value, i.e. the analyses may be underpowered and vulnerable to inflation of P-values.Table 2. Overlaps of genes encoding different classes of Virus Interacting Protein (VIP) with saltiLASSI selected regions in UKB and CKB participantssaltiLASSI Selection Region SetCKBUKBCombinedVIPOverlapEnrichmentPOverlapEnrichmentPPDNA_ref34/1273 (2.7%)1.00ref35/1273 (2.7%)1.00ref1.000RNA144/3495 (4.1%)1.540.010*131/3495 (3.7%)1.360.0480.460non-COV117/3101 (3.8%)1.410.035111/3101 (3.6%)1.300.0830.736CoV27/394 (6.9%)2.575.6 × 10^–5^*20/394 (5.1%)1.850.0120.367non-SARS5/77 (6.5%)2.430.0263/77 (3.9%)1.420.2770.719SARS22/317 (6.9%)2.601.1 × 10^–4^17/317 (5.4%)1.950.00960.509under-selection7/40 (17.5%)6.555.5 × 10^–8^2/40 (5.0%)1.820.1990.154Genes encoding different classes of VIPs were scored according to whether they lay within 10Kbp of saltiLASSI-identified regions of selection (“overlap”), Enrichment and P-values were calculated relative to the proportional overlap for DNA virus VIPs using a two-sample proportion test, with the DNA VIP overlap as the null success rate. The HLA region (chr6:20-40Mbp) and VIP genes lying within it were excluded from analysis. Asterisks next to P-values denote significance after multiple testing adjustment (P < 0.05/(22.36)). "Combined" is a P-value from a Fisher exact test comparing for differences in overlap between UKB and CKB

To provide a more rigorous test for these observed enrichments of signals of selection near to VIPs, we adopted a bootstrapping approach similar to that used for LRLD. For each set of VIP structural genes, we scored the number that were in close proximity (within 10Kbp) to one or more saltiLASSI-identified regions, and compared this with a null distribution derived by redistributing the saltiLASSI-selected regions 10,000 times across the genome. In order to retain large-scale patterns of GC and gene-content, permutation of the location of selection regions was restricted to blocks of genome on the same chromosome and with similar characteristics; since the permutation blocks were in the region of 300Kbp, it was necessary to exclude from analysis large saltiLASSI selection regions exceeding the block size (those > 500Kbp) and the VIPs in close proximity to them. Nevertheless, despite the resulting reduction in statistical power, in CKB both CoV-VIPs (enrichment ratio, ER = 2.12; 95% CI 1.13–5.67; P = 0.004) and SARS-VIPs (ER = 2.17; 95% CI 1.08—6.50; P = 0.009) were once again strongly enriched for proximity to saltiLASSI-selected regions compared to the empirical null (Table 3). By contrast, we found no evidence for appreciable enrichment in any class of VIPs in UKB, consistent with the similar analysis of regions LRLD. In each case, DNA-VIPs gave enrichment very close to 1 compared to the null, suggesting that the method is well-calibrated. These findings were robust to variations in the distance between the VIP structural gene and selection region used to define ‘proximity’, and to different sensitivity thresholds for detection of selection regions, with different sets of parameters giving qualitatively the same results (Supplementary Tables S10a-d).Table 3. Overlaps between genes encoding different classes of Virus Interacting Protein (VIPs) and saltiLASSI selected regions in UKB and CKB participantssaltiLASSI Selection Region SetCKBUKBVIPOverlapEnrichmentPOverlapEnrichmentPAll114/4663 (2.4%)1.06 (0.78—1.54)0.362105/4667 (2.2%)1.05 (0.77—1.50)0.381DNA viruses21/1254 (1.7%)0.72 (0.48—1.24)0.88923/1250 (1.8%)0.85 (0.56—1.44)0.719RNA viruses93/3409 (2.7%)1.18 (0.85—1.75)0.15782/3417 (2.4%)1.12 (0.81—1.67)0.240non-CoV76/3031 (2.5%)1.07 (0.78—1.62)0.32072/3033 (2.4%)1.11 (0.79—1.64)0.283CoV17/378 (4.5%)2.12 (1.13—5.67)0.00410/384 (2.6%)1.43 (0.71—3.33)0.152non-SARS4/74 (5.4%)4.00 (0.80—Inf)0.0301/76 (1.3%)1.00 (0.20—Inf)0.464SARS13/304 (4.3%)2.17 (1.08—6.50)0.0099/308 (2.9%)1.50 (0.82—4.50)0.088under-selection1/39 (2.6%)Inf (0.33—Inf)0.1422/38 (5.3%)2.00 (0.67—Inf)0.062Enrichment and P-values (one-tailed test) were determined for the frequency with which the genes encoding different classes of VIPs lie within 10Kbp of saltiLASSI-identified regions of selection, compared with 10,000 bootstrap iterations of randomly distributing regions of selection across the genome, controlling for local gene density. Enrichment and 95% CIs are derived from the median and 2.5%/97.5% centiles of the null distribution. The HLA region (chr6:20-40Mbp) and other excludable regions (see methods), and VIP genes lying within them, were excluded from analysis. Asterisks next to P-values denote significance after multiple testing adjustment (P < 0.05/(2*2.36))

Since selective sweeps can be mimicked by regions of low recombination which was not controlled for in the above permutation analyses, we assessed whether there was any variation in recombination rates between different classes of VIPs which might influence our findings. A permutation test based on the chi-squared F-statistic revealed there was no evidence of differences in log transformed mean recombination rates across different VIP categories (P = 0.624) (Supplementary Table S12). Thus, there are no differences in recombination rate which might drive apparent enrichment of selective signals in particular VIP categories.

Regional analysis

Since CKB participants were recruited in 10 geographically diverse regions across China [26], we conducted further analyses to explore whether there were differences in selective signals between regions which might narrow down the geographical origins of the putative historical epidemic(s) which gave rise to the LRLD and saltiLASSI signals. Using phased CKB genetic data, we identified haplotypes which spanned the regions of LRLD that overlapped with CoV VIPs (n = 36) and determined the frequencies of a random subset of 2000 of the most common haplotypes in the different CKB recruitment regions (Supplementary Table S11). No consistent pattern was discernible from this analysis, with no region showing strong evidence of having higher frequencies of these long-range haplotypes (Fig. 2a). Similarly, we repeated saltiLASSI analyses for equal numbers of individuals from each CKB recruitment region (n = 10) and, for the 28 saltiLASSI selection signals which overlapped with COV-VIPs in the main analysis, scored the frequency with which these selection signals were identified when restricting analysis to individuals from a single recruitment centre (Fig. 2b. Again, no clear difference was observed between regions, with 78–92% (mean 87%) of the signals being replicated across each region.Fig. 2a Frequency, in each recruitment region (n = 10), of the most common haplotypes which span the regions of long-range LD which overlap with the COV-VIPs (n = 36). We took the phased CKB data and extracted the haplotypes which covered the regions inferred to be under selection by the LRLD selection method and calculated the frequency of the most common haplotype (across the entire dataset), within each region. Each region was sub-sampled to 2000 randomly selected individuals to ensure comparability across regions. b Replication of saltiLASSI selection hits in different recruitment centres (RC) in China (n = 10). We analysed each recruitment region using saltiLASSI separately, after subsampling the number of individuals to match the RC with the fewest number of individuals and determined whether selection hits obtained from analysing the full cohort were replicated in each RC. The colour of each tile represents the proportion of the selection hit inferred in the full cohort which was covered by the selection hit inferred a given CKB region

Overlap with SARS-CoV-2 susceptibility GWAS associations

To explore whether signals of selection overlapped with recently published GWAS associations for COVID susceptibility [27], we selected all 51 lead SNPs which reached genome-wide significance (i.e. 5 × 10^–8^) "Critical illness", "Hospitalized" and "Reported infection" and counted the overlap between these loci (defined by the region within 200Kbp of the lead variant) and either the LRLD or the saltiLASSI regions identified in CKB and UKB, again using permutation across the genome to provide an empirical null. Of the 51 lead variants for COVID-19 susceptibility, 6 overlapped with regions of LRLD in CKB (ER = 1.67, 95% CI 0.71– Inf, P = 0.257), and 5 overlapped with UKB LRLD regions (ER = 0.71, 95% CI 2–12, P = 0.836). The corresponding analysis using saltiLASSI regions, and the bootstrapped regions from the previous section, yielded similar results for both CKB (7/51; ER = 1.75, 95% CI 0.78–7.00, P = 0.09) and UKB (5/51; ER = 1.25, 95% CI 0.56–5.00 P = 0.27). The saltiLASSI results were robust to variation in the size of the window surrounding the GWAS lead variants (50Kbp-500Kbp) (Supplementary Table S12).

Discussion

Viral epidemics are expected to exert relatively fast-acting selection on the human genome. Such events can leave footprints in the form of ‘selective sweeps’, in which linked neutral variants ‘hitchhike’ to higher frequency, thereby reducing genetic variation around a selected locus and generating regions of LRLD [13, 28]. Previous work provided evidence that an ancient coronavirus epidemic(s) more than 20,000 years ago drove selective sweep(s) in the genomes of EAS individuals [12]. However, this study was based on a relatively small size of ~ 100 individuals from each of 5 populations, and simulations have shown progressive increases in sweep detection accuracy with increasing sample sizes; for instance, for a sweep occurring 1000 generations ago, increasing the number of samples from 500 to 1000 yields between a 1.07 to 1.23 fold increase in the probability of accurately identifying the sweep location to within 50-200Kbp of the true location [20]. We sought to further investigate these potential signatures of historic viral epidemics in a substantially larger dataset comprising ~ 70 k individuals from each of CKB and UKB, and again found that genes encoding proteins which interact with coronaviruses are significantly more likely to be near regions of selection, in Chinese but not British individuals.

We first assessed enrichment of VIP genes in regions of LRLD, relative to a null distribution in which regions of the same size were repeatedly randomly distributed around the genome, an approach widely used to assess the significance of associations of one genomic feature with another (e.g. [29, 30]). DNA virus VIP genes showed no enrichment of LRLD, consistent with the previous finding of no signals of selection at these VIPs. By contrast, genes encoding a set of SARS-VIPs that were previously identified as potentially under selection showed a 2.5-fold enrichment of nearby LRLD regions. Together, these findings indicate that this approach is both well-calibrated and capable of detecting signatures of selection.

However, LRLD does not arise exclusively from selective sweeps and can arise due to various non-selective processes, such as restrictions on genetic recombination due to genomic structural variants such as inversions [31–33]. Furthermore, demographic events such as population bottlenecks can produce regions of LRLD [24] and other signatures which mimic positive selection and can thus confound selection scans. Such processes could influence both the previous analyses [12] and our analysis based on LRLD. To address these potential issues, therefore, we conducted a separate analysis in which signals of selection were instead identified using saltiLASSI, which accounts for the spatial distribution of the sweep test statistic across the genome and is thereby more robust to non-selection demographic events. Once again, there was no enrichment of signals of selection near to DNA virus VIPs, while the previously identified set of SARS-VIPs potentially under selection showed substantial enrichment relative to other VIP classes.

These two analyses, based on methodologically distinct approaches to identification of regions of selection, gave consistent results when applied to different classes of VIP. Both identified strong enrichment of selection regions in CKB near to CoV-VIPs and SARS-VIPs, but not near non-CoV-VIPs. Both methods also identified enrichment at non-SARS-VIPs (comparable to that at SARS-VIPs), although this was based on a smaller number of VIPs in this class, giving wide confidence intervals. Conversely, we found no evidence of any significant enrichment of any classes of VIPs in UKB. Furthermore, the observed enrichment in CKB was almost entirely derived from regions of selection identified only in CKB and not in UKB. These findings, that there is enrichment of signatures of selection at genes encoding CoV-VIPs in CKB but not UKB, are entirely consistent with the hypothesis that one or more historical epidemics of coronaviruses (or other viruses which interact similarly with cellular processes) occurred in the ancestors of modern-day EAS populations.

Given the geographical restriction of the putative selective sweep(s), it was of interest to explore whether the much larger sample size in our analyses enabled any greater geographical resolution of the origins of such sweep(s). However, we found no clear evidence that enabled localisation of the enrichment of selection signals to one or more particular regions in China. This is perhaps not surprising, as many population migrations and population mixing have taken place across China in the past 20,000 years which are likely to have obscured any region-specific signals [34]. Alternatively, any epidemic may have been widespread across East Asia, which would be consistent with the results of Souilmi et al. who found signals in other East/South East Asian countries.

It is known that environmental pressures in the history of a population may confer lasting adaptations to humans [35–37]. Therefore, it is plausible that widespread, and potentially repeated, historical coronavirus epidemics in East Asia may have provided a degree of resistance for modern-day East Asians to the recent COVID-19 pandemic. Evidence has shown that different populations have different mortality risks from severe COVID-19 [38, 39]. Whilst it is clear that sociodemographic factors and provision of appropriate health care play a substantial part in these differences, there is also the possibility that variants with protective effects against COVID-19 may be distributed differentially across populations. For instance, African-American ancestry has been reported as an independent risk factor for hospitalisation from COVID-19 [40],

If ancient and current coronavirus epidemics have VIPs in common, the putative historic epidemic in East Asia may have driven selective sweeps in regions of the genome which are currently under selection by COVID-19. This could be manifested in a higher number of overlaps between GWAS hits and regions of selection in Chinese compared to British individuals. However, we found no discernible differences between the populations. Whilst this may be due to a low number of GWAS hits, or may reflect insufficient EAS individuals in GWAS of COVID-19 susceptibility, this may also point to differences in the proteins relevant to present-day and ancient coronaviruses. A further consideration is that signatures of selection from the putative historical epidemic(s) will include contributions from the long-term effects of the virus, analogous to the long-term “Long COVID” effects of SARS-CoV-2, which are not included in GWAS of COVID-19 susceptibility.

Nevertheless, observational evidence suggests that countries in East Asia have a substantially lower acute case-fatality rate than comparable countries in Western Europe [41]. While it is very likely a substantial part of the observed differences in severity and case-fatality rates between different populations are due to non-biological factors, e.g. public policy and differences in social behaviour, these data, alongside VIP enrichment results from this and previous studies, suggest that EAS populations may have a higher frequency of alleles protective against severe COVID-19, that may be one cause of the reduced case-fatality rates in these populations. For instance, non-synonymous mutations in TMPRSS2 that confer decreased COVID-19 susceptibility are found at higher frequencies in EAS (36%) than EUR (23%) [42].

Variation in genetic susceptibility to disease across different ancestries, driven by differing natural selection environments, is well documented; for instance, alleles which provide a protective effect against Malaria are found at substantially higher frequencies in West Africa than in Europe [43, 44]. Further, in vivo studies have shown that the transcriptional response of primary macrophages to live bacterial pathogens varies between ethnicities, and that genetic effects in the immune response are strongly enriched for recent, population-specific signatures of adaptation [45]. It is also known that epi/pandemics drive population-level immunity which protects against future outbreaks; historical evidence cites that during the initial outbreak of the plague across Europe, during 1347 to 1353, not a single town was re-infected two or more years running [46]. Thus, it is plausible that past coronavirus selective sweeps in EAS populations have provided a degree of resistance to COVID-19. For instance, the ACE2 locus on the X chromosome (not included in our analysis, which was restricted to autosomes) has reduced haplotype diversity in EAS, one of the two major ACE2 haplotypes has been suggested to be under positive selection, with possible implications for COVID-19 susceptibility [47]. Furthermore, single-cell transcription analyses have shown that natural selection has driven population-specific differences in the immune response to SARS-CoV-2 [48].

The key strength of our study is the use of biobank-scale data to provide greater power to identify signals of selection. Apart from the large sample sizes, the genetic data in CKB and UKB were generated using similar Axiom arrays with 50% of the genotyped variants being the same, reducing the chance of possible array-specific confounding. Moreover, we used a method, saltiLASSI, which is more robust to false-positives than the approach used by Souilmi et al. [15], and the involvement of 10 geographically diverse regions in CKB enabled us to explore possible regional variations within China.

However, the study also has limitations, most notably that, unlike the study by Souilmi et al. which was based on sequence data, we only had access to genotype array data. This means we could not explore more detailed parameters of putative sweeps, such as their strength and age, and we were unable to seek to replicate the inferred dates of previous epidemics and associated selective sweeps. In addition, the methods we use did not specifically account for the fact that variable recombination rates across the genome may affect selective sweep detection; for example, we would expect that recombination coldspots might mimic selective sweeps or make detecting them easier. However, we found no evidence for strong differential differences in recombination between VIP classes. In addition, we suggest that any potential bias introduced by the inclusion of additional false-positive signals of selection in our analysis would be towards the null.

Conclusions

This study provides further evidence that epidemics have shaped the genetic landscape of East Asian populations, as observed through significant enrichment of coronavirus interacting protein (CoV-VIP) genes in regions undergoing selection. Our findings, leveraging large biobank-scale datasets, reinforce the important role of pathogen epidemics in human evolutionary history but also underscore the potential influence of ancestral viral exposures on population-specific disease susceptibility as a research avenue.

Supplementary Information

Supplementary Material 1. Supplementary Material 2. Supplementary Material 3. Supplementary Figure 1. Receiver Operating Characteristic (ROC) curves comparing the performance of Long Range-LD selection regions, using saltiLASSI selection regions as the truth set. The left panel (labelled "BP overlap") evaluates performance based on summing the total number of overlapping base pairs, while the right panel (labelled "Counting overlap") evaluates performance based on counting overlapping regions. The True Positive Rate (TPR) is plotted against the False Positive Rate (FPR) for different thresholds, with a dashed diagonal line representing the performance of a random classifier for comparison.

Bibliography1

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Pan YW, Liu PH, Wang F, Wu P, Cheng FJ, Jin X, et al. Lineage-specific positive selection on ACE 2 contributes to the genetic susceptibility of COVID-19. Natl Sci Rev. 2022;9(9).10.1093/nsr/nwac 118PMC 952936036199986 · doi ↗ · pubmed ↗