Calculating genetic risk scores directly from summary statistics with an application to type 1 diabetes
Steven Squires, Michael N Weedon, Richard A Oram

TL;DR
This paper introduces a method to calculate genetic risk scores using only summary statistics, demonstrated with an application to type 1 diabetes.
Contribution
A novel method to compute genetic risk scores directly from summary statistics, enabling broader comparisons without raw genetic data.
Findings
The mean T1D GRS calculated from genetic data and summary statistics were nearly identical (10.31 vs. 10.38).
The AUC for case-control discrimination was 0.917 for genetic data and 0.914 for summary statistics, showing strong performance.
Abstract
Genetic risk scores (GRS) summarise genetic data into a single number and allow for discrimination between cases and controls. Many applications of GRSs would benefit from comparisons with multiple datasets to assess quality of the GRS across different groups. However, genetic data is often unavailable. If summary statistics of the genetic data could be used to calculate GRSs more comparisons could be made, potentially leading to improved research. We present a methodology that utilises only summary statistics of genetic data to calculate GRSs with an example of a type 1 diabetes (T1D) GRS. An example on European populations of the mean T1D GRS for those calculated from genetic data and from summary statistics (our method) was 10.31 (10.12–10.48) and 10.38 (10.24–10.53), respectively. An example of a case-control set for T1D has an area under the receiver operating characteristic curve…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9| Mean (s) | Minimum | Maximum | |
|---|---|---|---|
|
| 60.4 | 45.0 | 90.8 |
|
| 124.0 | 106.8 | 139.5 |
|
| 81.3 | 68.0 | 119.0 |
|
| 112.9 | 98.4 | 145.0 |
|
| 108.2 | 87.7 | 143.9 |
| Mean | Standard deviation | |
|---|---|---|
| SAS | 9.05 (8.86–9.25) | 2.29 (2.13–2.43) |
| SAS | 9.03 (8.87–9.19) | 2.43 (2.31–2.55) |
| AFR | 8.42 (8.28–8.56) | 1.88 (1.78–1.97) |
| AFR | 8.40 (8.28–8.52) | 1.86 (1.76–1.95) |
| EAS | 10.51 (10.34–10.69) | 2.05 (1.94–2.17) |
| EAS | 10.48 (10.36–10.6) | 1.96 (1.86–2.04) |
| EUR | 10.31 (10.12–10.48) | 2.11 (1.96–2.25) |
| EUR | 10.38 (10.24–10.53) | 2.26 (2.15–2.36) |
| AMR | 10.72 (10.5–10.97) | 2.17 (2.02–2.33) |
| AMR | 10.83 (10.7–10.98) | 2.15 (2.06–2.24) |
| HLA | Non-HLA | Interaction | |
|---|---|---|---|
| AFR | 5.62 (5.45–5.78) | 2.24 (2.19–2.28) | 0.56 (0.47–0.64) |
| AFR | 5.62 (5.49–5.75) | 2.23 (2.19–2.26) | 0.55 (0.49–0.61) |
| SAS | 6.01 (5.82–6.21) | 3.01 (2.95–3.07) | 0.03 (0.00–0.07) |
| SAS | 5.96 (5.82–6.11) | 3.03 (2.98–3.07) | 0.04 (0.01–0.06) |
| EAS | 6.98 (6.79–7.18) | 3.40 (3.36–3.45) | 0.12 (0.07–0.18) |
| EAS | 6.95 (6.82–7.07) | 3.40 (3.37–3.43) | 0.14 (0.10–0.18) |
| AMR | 7.30 (7.04–7.55) | 3.22 (3.14–3.30) | 0.21 (0.11–0.29) |
| AMR | 7.34 (7.20–7.47) | 3.24 (3.19–3.28) | 0.22 (0.17–0.27) |
| EUR | 6.43 (6.21–6.62) | 3.54 (3.47–3.60) | 0.34 (0.26–0.42) |
| EUR | 6.45 (6.30–6.61) | 3.52 (3.47–3.57) | 0.32 (0.27–0.38) |
| Correlated | Correlated | Non-correlated | |
|---|---|---|---|
| number | mean score | mean score | |
| EUR | 23 | 6.96 (6.76–7.15) | 3.01 (2.9–3.12) |
| EUR | 23 | 6.88 (6.69–7.06) | 3.06 (2.94–3.17) |
| SAS | 15 | 6.11 (5.93–6.3) | 2.91 (2.8–3.02) |
| SAS | 15 | 6.07 (5.91–6.26) | 3.03 (2.91–3.17) |
| AFR | 15 | 5.91 (5.74–6.08) | 1.95 (1.85–2.04) |
| AFR | 15 | 5.9 (5.74–6.04) | 1.94 (1.84–2.04) |
| AMR | 22 | 7.84 (7.6–8.08) | 2.68 (2.55–2.8) |
| AMR | 22 | 7.9 (7.67–8.15) | 2.72 (2.6–2.85) |
| EAS | 21 | 4.75 (4.58–4.91) | 5.64 (5.54–5.73) |
| EAS | 21 | 4.81 (4.67–4.96) | 5.67 (5.58–5.77) |
- —National Institute for Health
- —Care Research Exeter Biomedical Research Centre
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Diabetes and associated disorders · Liver Disease Diagnosis and Treatment
1 Introduction
A genetic risk score (GRS) provides a summary of variation in the genome that are related to some trait of interest (Lewis and Vassos 2020). Generally these are produced by application of genome wide association studies (GWASs) (Uffelmann et al. 2021) that search through genetic variants looking for parts of the genome that are associated with that trait and provides both the statistical significance and the strength of the association.
A GRS is produced by taking the genetic variants that show a statistically significant association with a trait. The associated variant weights (strength of the association) are summed with the number of alleles of that variant (out of the two chromosomes). There are also more complex GRSs, e.g. including interaction terms between variants, which turn the standard linear GRS into a more complicated non-linear GRS (Oram et al. 2016). A useful GRS will provide good discrimination between the groups of samples with the trait from those without.
The primary motivation for this work is to make the calculation of GRSs more accessible for research. While methods exist for simulating haplotype data (Montana 2005) to our knowledge there are no approaches to directly simulate GRSs. Our work is focussed on providing and demonstrating the practical use of an approach to calculation of a GRS in populations with available summary statistics. A common research problem might be about how a specific GRS works on some particular dataset, or on multiple datasets. As the purpose of a GRS is the capacity to discriminate between samples with and without a trait, making multiple comparisons can be essential. For example, if an aim is to test an existing or new GRS it is likely that it will require multiple datasets to provide a more confident judgement about its performance. Datasets of interest might include: different population ancestry, variants of a disease, geographic locations, or socio-economic status.
There are two major issues with calculation of GRSs in multiple datasets: one is access to the data and the second is the expertise required to correctly calculate the GRS. The first issue can be difficult because of, for example, ethical (Takashima et al. 2018), legal (Clayton et al. 2019), or computational (Qin et al. 2015) reasons. The second issue is that genomic data tends to be held in specific file formats and requires particular software to be effectively run, requiring people with specific skill-sets. This is a barrier for groups without this expertise to utilise GRSs.
Our aim is to provide a method to calculate a GRS from summary statistics without the need for the genetic data. Ethical and legal requirements can be significantly reduced as only a small amount of data about the dataset needs to be released to enable the production of simulated data. A range of scores on multiple datasets can be produced to enable much wider comparisons. Significantly less specific technical skills are required as the summary statistics are all that is needed to calculate the GRSs along with the provided Python code. If summary statistics could be collected and stored in one place then it would be possible to automatically produce datasets of GRSs from just those summary statistics.
We provide a methodology to take easily accessible summary statistics and generate simulated complex GRSs without the need for access to the original genotype array data. Our method requires: single nucleotide polymorphism (SNP) frequencies, correlations between SNPs, and any SNPs that deviate from Hardy-Weinberg equilibrium (HWE) for each sub-group of interest (such as ancestry, cases or controls). Both simulated SNP arrays (for the SNPs in the GRS) and the GRS are generated from these summary statistics. We demonstrate that our approach produces similar results to GRS calculated on genetic data, relevant sub-parts of the GRS, and distributions of the SNP values.
2 Methodology
2.1 Overview
Our approach to calculating GRSs from summary statistics comprises three overall steps: (1) collection of necessary summary statistics; (2) generation of the simulated SNP array; (3) production of the final GRS. We briefly describe the overall method and show the steps in Algorithm 1 before going into detail subsequently. Algorithm 1.Method overview1) Collect inputs:• Allele frequencies.• Correlations between SNPs.• Fractions for SNPs out of HWE.2) Simulate SNP array:• Simulate SNPs from allele frequencies if in HWE.• Simulate SNPs out of HWE from fractions.• Calculate networks of correlated SNPs.• For each group of networked SNPs produce loss function.• Iterate through each correlated group swapping SNP-sample pairs if reduces loss function.• Continue iterating through until loss tolerance is reached or time/iteration limit reached.3) Generate GRS:• Linear scores generated using number of effect alleles with SNP score.• Interaction term generated and ranked then added to linear score.There are three summary statistics required: the allele frequencies of each of the SNPs; the correlations between all the SNPs; the proportions of homozygous and heterozygous alleles if any of the SNPs deviate from Hardy-Weinberg equilibrium (HWE). These can either be accessed by application to a server producing these summary statistics or by calculation from the original SNP arrays.
The aim of simulating SNP arrays is to produce individual samples that are indistinguishable from real data. Our approach is to simulate SNP arrays that have both the same allele frequencies and the same correlations between SNPs as the real data.
We perform this in several stages. First, N samples are generated from the allele frequencies with any SNPs that deviate from Hardy-Weinberg equilibrium generated separately based on the proportions of homozygous and heterozygous alleles. Second, the correlated SNPs are collected into G groups of correlated SNPs. Within each group the correlations between the SNPs are calculated and a loss function is defined. Pairs of each correlated SNP in turn (across two samples) are compared and switched if they reduce the loss function. This approach is continued until the loss function falls below a chosen tolerance. The same procedure is applied for all G groups, which results in the final simulated SNP array.
As an example of GRS production, we use a type 1 diabetes (T1D) GRS (Sharp et al. 2019) that includes both a linear term and a term related to interacting SNPs. These two parts are generated separately from the simulated SNP array and combined together. Our model for calculating a GRS from summary statistics would work on any linear model (the interaction terms are excluded) or any GRS with both a linear and interaction part that is similar to this T1D GRS.
2.2 Datasets and GRS
To develop our method and generate results, we need the relevant summary statistics. To test that the simulation of the SNP arrays works effectively, we need to also have real SNP array data so we can make direct comparisons.
We use two publicly available datasets: the 1000G version 3 data (Auton et al. 2015) and UK Biobank (Sudlow et al. 2015) SNP array data. The 1000G dataset consists of 2,503 samples from five super-populations: Europeans (EUR), Africans (AFR), Americans (AMR), East Asians (EAS), and South Asians (SAS). These five super-populations are made up of a total of 26 populations with 4 to 7 populations making up each super-population. We consider five subsets of the UKBB data. Four of these are defined by applying principal component analysis (PCA) to the SNP arrays and defining the subsets by similar locations in the first two principal components to four of the 1000G data (EUR, AFR, EAS, SAS). The fifth subset is European samples with T1D.
As a demonstration of the effectiveness of the method we consider a T1D GRS (Sharp et al. 2019) which consists of 67-SNPs. The GRS involves both a set of linear weights on the number of effect alleles and an additional pair-wise interaction term based upon which of 18 pairs of 14 SNPs each sample has. Importantly, the GRS contains multiple SNPs from part of the genome where there are known to be substantial correlations between the SNPs (within the human leukocyte antigens, [HLA], region). Therefore, it is important when generating simulated data to fully include these correlations between SNPs to ensure we are generating a set of SNP array data which is similar to the real data.
Due to missing SNP data in the 1000G dataset we exclude two SNPs leaving 65 in the GRS. For the UKBB data we use all 67 SNPs.
2.3 Collection of summary statistics
We collect summary statistics from two datasets: the 1000G and the UK Biobank. For 1000G we use the NIH LDlink suite of applications (Machiela and Chanock 2015) to extract variant frequencies and linkage disequilibrium (LD) between SNPs. We also use the 1000G SNP array data to extract the SNPs that deviate from Hardy-Weinberg equilibrium.
For the UK Biobank, we use imputed array data, extracting the required SNPs and taking the frequencies and correlations from the arrays. We use similar positions in 2-dimensional PCA space as the 1000G super-populations.
2.4 Generation of simulated SNP arrays
We apply two constraints to the SNP arrays, the first is that the frequencies of the alleles need to match the summary statistics from the real data. Secondly, the correlations between the SNPs need to also match the known correlations. There may be a third constraint if any of the SNPs are out of Hardy-Weinberg equilibrium, for example due to population stratification. If that is the case, then we need to impose an additional constraint to set the correct proportions of homozygous and heterozygous alleles for these SNPs.
The key aspect of our method is that the allele frequencies (and any deviations in homozygosity and heterozygosity) remain constant under permutations of the SNPs between samples. So, if we set up a matrix of sample and SNP values that has the correct SNP frequencies (and homozygosity deviations) then if we permute individual SNP-samples (i.e. switch over SNP values between samples) in the matrix we do not alter the frequencies but can change the correlations between SNPs.
We define the simulated SNP array as a matrix where p is the number of SNPs in the GRS and N is the number of desired simulated samples. We consider each SNP in turn and generate the simulated data in one of two ways. If the SNP holds to HWE then we generate it from the frequency by the generation of two binary N-dimensional vectors (one for each chromosome) with probabilities set by the allele frequency, the SNP values are then the sum of the two. If the SNP deviates then we have the fractions of the homozygous and heterozygous values, and we generate fractions of 0 s, 1 s, and 2 s as required.
On generation of the SNPs we have a matrix of size of 0 s, 1 s, and 2 s but with no correlation between the SNPs. The SNPs with no expected correlation are left alone. For the rest of the SNPs we group them into those that are correlated together. In Figure 1 we show an example of correlated SNP groups. The nodes (labelled A-G) represent SNPs and the lines represent those SNPs which are correlated together. So are correlated together but neither A or B are correlated with D or E. E is only correlated with D. D is correlated with both C and E. F and G are correlated together but neither is correlated with any of the other SNPs. As the pair of groups and are not correlated together we can optimise them separately.
Examples of correlated SNPs. The two groups A-E and F-G are not correlated together. Within group A-E: A, B and C are all correlated together; C is additionally correlated with D; D is correlated with C and E; E is just correlated with D. F and G are correlated together and neither is correlated with any other SNPs.
To find the correlated groups we define a correlation threshold between SNPs above which the SNPs are defined as correlated (with number of SNPs, ) and below which are defined as uncorrelated. This correlation threshold is set to exclude spurious correlations caused by chance; for the experiments we predominantly set it to 0.1, the same as the default used by LDlink (Machiela and Chanock 2015).
We build an adjacency matrix for all SNPs and use the Python network analysis package NetworkX (Hagberg et al. 2008) to extract the set of groups of correlated SNPs, . Each group of SNPs is connected by correlations between then, e.g. the SNPs or from Fig. 1.
We now have a set of groups of SNPs. We perform the same procedure separately for each, , group. We set up a loss function which is defined by
where the sum is over the correlated pairs of SNPs (from 1 to l), is the correlation between the pair of simulated SNPs and is the real correlation between the SNPs. For example, the loss function for the left side set A-E of SNPs in Fig. 1 would be given by
For a group of correlated SNPs, , the procedure is as follows. We consider all possible SNP pairs in and randomise the order we consider the SNPs. We then take a random permutation of the N samples and compare the neighbouring (randomised pairs) of samples, for that pair of SNPs. We swap the SNP values across the pair of samples and check if the loss function has reduced, if it has we make the change permanent, if not then we revert to the previous state. The algorithm then moves onto the next pair of samples and performs the same procedure. If at any point the loss falls below the defined tolerance the iterations are stopped and the SNP values are set. Once all the pairs of randomised samples have been iterated through (and if the loss is still above the tolerance level) the same procedure is applied to the next pair of SNPs (if there are more than two SNPs in the group). Once the procedure has been applied to all pairs of SNPs, the order of SNPs is randomised so different pairs of SNPs are compared in order. A new random permutation of samples is set up and the iteration through pairs of samples continues. There is also a set iteration or time limit that can be imposed.
The iteration limit applies to each group of correlated SNPs; if it is reached the algorithm stops updating the SNPs in that group. This will be before the desired tolerance is reached; those SNPs will not have reached the desired level of pairwise correlation. The tolerance set is the acceptable level of the loss given by Equation 1. When the loss is below the set tolerance level the algorithm stops iterating and returns the results. The final convergence of the approach is achieved when all the groups of correlated SNPs have had their loss reduced below the tolerance level. All results in this paper are from SNP arrays which reached the final convergence.
Part of this process is shown, as a toy example, in Fig. 2. There are two correlated SNPs (the two columns) with four samples (the four rows). The algorithm first swaps and (step 1). It then calculates the new loss function and it has not fallen therefore in step 2 the pair of SNPs are returned to the original positions. The algorithm then (step 3) swaps and and checks the loss function. This time the loss function is lower so it leaves the SNP array in the new permutation.
A toy example showing the algorithm being applied to two correlated SNPs with four samples. In step 1 x11 and x21 are swapped but the loss function does not fall. So in step 2 the algorithm reverts the array back to the previous state. Then in step 3 x11 and x31 are swapped and this time the loss function is lower, so the new state is kept.
The overall procedure for simulation of a SNP array is shown in Algorithm 2.
Algorithm 2.Simulating SNP arrays Inputs: Frequencies. SNP fractions. Correlations. Correlation threshold. Loss tolerance. Iteration or time limit. Steps:Generate SNP array from allele frequencies or SNP fractions.Build adjacency matrix from real correlations above the correlation threshold.Separate SNPs into groups of correlated SNPS for i = 1 to m do ** **Generate loss function ** Within group with h correlated SNPs: ** while iteration below iteration limit and loss above tolerance do
** Randomise order of SNP comparisons ** for i = 1 to h-1 do ** for** randomised, non-identical sample pairs do ** **Temporarily swap values ** **Check loss function and if smaller than previous lowest loss function make swap permanent end for end for end while end for
Output: Simulated SNP array,
2.5 Production of the GRS from simulated SNP arrays
The simulated SNP array is a matrix with values of with each row representing a sample and each column a SNP. We demonstrate the production of a GRS that is a combination of a linear weighted sum of the alleles with an interaction term between pairs of SNPs. Any GRS of this form, or with just the linear weights could be simulated with this method.
The linear weighted sum requires an effect allele and a weight. The SNP value is switched if the effect allele and the allele representing 2 are not matched. For all p-SNPs the scores are weighted and summed.
For this GRS there is also an interaction term. Here if combinations of pairs (18 in total) of 14 of the SNPs are present together then there is an addition term. Only one interaction term (or none) is permitted per sample and there is a ranking of terms if more than one is present which gives the final interaction term. The interaction term and the linear term are summed together to give the final score.
2.6 Analysis of the simulated data
This method produces both a SNP array and a final GRS. To assess how well the approach performs, we compare the data from genetic arrays to the data from summary statistics for both 1000G and UKBB datasets. We investigate the final GRS values, relevant sub-parts of the GRS, and comparisons of the SNP arrays themselves.
2.7 Comparison to final GRS
The 1000G data is a combination of data from different populations around the world. These are combined into 5 super-populations from 26 populations. The simulation method requires the allele frequencies and SNP correlations but these are dependent on which grouping is considered, i.e. whether we use summary statistics from the super-population or population level. We investigate simulations at both levels.
The UKBB data is drawn from the UK so is of people of predominantly European descent but there are also substantial numbers of people of other ancestries. We therefore define four different super-populations using PCA and simulate data from those groups. As a fifth group we also use 387 people with T1D from UKBB. This group of European T1D samples (together with the European UKBB super-population) lets us also compare the receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) for the real and simulated data.
2.8 Comparison to Sub-parts of the GRS
We also explore how the model performs at simulating sub-parts of the GRS. Risk of T1D is dominated by the HLA region so we split the GRS into three parts: SNPs from the HLA region, SNPs not from the HLA region, and the interaction term, to assess whether the simulation is correctly estimating each part. In addition, we separately compare the parts of the GRS generated from SNPs that are uncorrelated with any other SNPs in the GRS and those SNPs which are correlated with at least one other SNP in the GRS.
2.9 Comparison of the SNP arrays
The GRS we consider has 67 SNPs (65 for the 1000G due to missing SNPs). The frequencies of the simulated SNP arrays are defined by the algorithm to match the real frequencies and it (unless it reaches an iteration/time limit before the loss function has fallen below the tolerance limit) adjusts the SNPs with known correlations to other SNPs to be correlated equivalently to the real data (within a tolerance). However, there may be differences between the simulated and real SNP arrays that are not clear from either these summary statistics or the sub-parts of the GRS.
We therefore explore whether the overall structure of the simulated SNP arrays is similar to real SNP arrays by performing principal component analysis (PCA) on the 67 (or 65) SNPs and projecting down onto the first two principal components (PCs). We then investigate similarities and differences between the simulated and real SNP arrays at the level of the first two PCs.
2.10 Experiment details and assessment of algorithm performance
For the experiments conducted using the 67 (65) SNP GRS we predominantly utilised a minimum required correlation between SNPs of 0.1, which is the default on LDlink below which the correlations are not considered meaningful. We set a tolerance of 0.03; all results shown in this paper met the convergence criteria and none were cut short by the iteration limit.
Indicative time to run the algorithm that generates a simulated SNP array and then calculates the GRS is shown in Table 1. The algorithm was run on a laptop (11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00 GHz 3.00 GHz with 32GB RAM) 100 times for the five 1000G super-populations to give an estimate of the average time to run the algorithm and the range of likely times. The iteration limit was set at 10000, the number of samples at 5000 and the tolerance at 0.03; none of the runs reached the iteration limit. The mean, minimum, and maximum time in seconds (s) for the algorithm are shown in Table 1
3 Results
3.1 Comparison to final GRS
In the top left plot of Fig. 3, we show the mean simulated GRS against the mean real GRS for the 1000G super-populations and populations. The super-populations are shown with a larger marker. In the top right plot, we show plots of the mean simulated GRS against mean real GRS for the UKBB super-populations and the T1D population. The real and simulated mean scores for all super-populations and populations are similar. We also show distributions of the 1000G super-populations for real and simulated data (bottom plot).
(Top left) Mean simulated GRS compared to mean real GRS for the 1000G super-populations and populations. The super-populations are marked with a larger marker size. The populations are labelled with their super-population labels, rather that population labels, for clarity. (Top right) Mean simulated GRS compared to mean real GRS for the UKBB super-populations and the T1D population. (Bottom) Distributions of the 1000G super-populations for real (left of pair and in blue) and simulated data (right of pair and in orange).
In Fig. 4, to the left of the dashed line, we show the distributions of real (left of pair and coloured blue) and simulated (right of pair and coloured orange) GRS for the five populations within the European super-population. Each population has approximately 100 samples within. To the right of the dashed line (violin plots coloured green) we show repeated simulations (labelled S2–S6) of the CEU population (with 99 samples) to demonstrate that there is some variation in the distributions simulated.
Left of the dashed line are the GRS distributions of the European populations with real GRS (left of each pair and in blue) and simulated GRS (right of each pair and in orange) shown side-by-side. TSI (Tuscan from Italy), FIN (Finnish), IBS (Iberians from Spain), GBR (British), and CEU (northern and western European ancestry from Utah, USA) are shown. To the right of the dashed line, and in green, are repeated simulations of the CEU data (labelled S2-S6).
In Table 2, we show the mean and standard deviation of the real (with subscript R) and simulated (with subscript S) GRSs for the super-populations from 1000G. For the mean and standard deviation (St. Dev.) the real and simulated metrics are mostly similar.
In the left plot of Fig. 5, we show ROC curves and AUC scores for separation of the European T1D and non-T1D data from UKBB for real and simulated data. There are no significant differences between AUC scores for real (denoted in the legend as R. AUC) and simulated (denoted in the legend as S. AUC) data. In the right plot of Fig. 5 we show ROC curves and AUC scores of three simulations of the T1D data (leaving the non-T1D simulated data the same) to show variability in ROC curves from repeated simulations.
(Left) ROC curves for separation of the non-T1D and T1D European populations of UKBB by the real (blue dashed line) and simulated (red dotted line) GRSs. Associated AUC scores are shown for the real (R. AUC) and simulated (S. AUC) data in the legend. (Right) ROC curves for three simulations of the T1D data (with the same simulated non-T1D scores for all three) with AUCs in the legend.
3.2 Comparison to subparts of GRS
The mean sub-parts of the GRS: the linear score of the HLA SNPs (HLA), the linear score of the non-HLA SNPs (Non-HLA), and the interaction term (Interaction) for the 1000G super-populations are shown in Table 3. All sub-parts are equivalent between the simulated and real data with no significant discrepancies.
In Table 4, we show the number of SNPs which are correlated with at least one other SNP in the GRS (Correlated number) and the mean summed scores for both the SNPs which are correlated with at least one other SNP and SNPs which are not correlated with any other SNPs in the GRS for the 1000G super-populations. The simulated GRS produces similar mean scores to the real data for both the correlated and non-correlated SNPs.
3.3 Comparison to SNP arrays
In Fig. 6, we show a comparison of the real (left plot) and simulated (right plot) 1000G data when applying PCA to the whole dataset (2503 samples) and the 65 SNPs. The first two principal components are shown and the colours represent the different super-populations. The pattern of the real and simulated data is similar with comparable separation of the super-populations.
PCA applied to the real (left) and simulated (right) whole 1000G dataset (2503 samples). The first two principal components are shown and the colours represent the different super-populations.
In Fig. 7 we demonstrate, using the AMR super-population, the importance of which summary statistics are used to simulate the SNP arrays. The top left plot shows the first two principal components applied to the 1000G AMR real data while the top right is the simulated SNP array when the simulations were generated using summary statistics at the super-population level. The bottom left shows simulated PCA of the SNP array when the four different populations making up the super-population are generated separately (with summary statistics at the population level) and concatenated together before PCA is applied. The bottom right plot shows the same plot as the bottom left but with markers and colours representing each population. The data is more spread out when the summary statistics are used from the populations rather than the super-populations.
The first two principal components of the real and simulated 65-SNP T1D GRS for the AMR super-population. Top left plot is the real SNP PCA. Top right is the simulated data when generated from the summary statistics on the super-population as a whole (labelled super-pop.). The bottom left is simulated data when four separate SNP arrays are generated for the four populations making up the super-population and then the SNP arrays are concatenated together before PCA is applied (labelled pops.). The bottom right plot shows the same as the bottom left but with the populations denoted with different marker shapes and colours.
In Fig. 8, we show that the simulation model can produce structure in the first two principal components similar to the real data. The top left and top right plots show the first two principal components of the real and simulated data of the AFR super-population of the 1000G data with the seven populations represented by different marker colours and shapes. In the bottom row of Fig. 8, we show a partial cause of the structure in the AFR super-population first two principal components plots. The left and right plots are the first two principal components of the real data and the simulated data, respectively. The three marker colours/shapes represent those samples which have one SNP (rs17843689) either as 0, 1, or 2 (on the effect allele).
The top left and right plots are the first two principal components of the AFR super-population data from the 1000G for real and simulated data respectively. Similar structure is visible in both PCA plots. The different colours of the markers are different populations. The bottom left (real data) and right (simulated data) plots show the relationship between a sample from the AFR super-population of the 1000G having a SNP (rs17843689) variant and its location in PCA space. The colours represent whether the sample has a 0, 1, 2 value for that SNP, labelled as SNP 0, SNP 1, SNP 2 respectively.
In Fig. 9, we show the first two principal components for the European T1D and non-T1D data from the UKBB dataset. The red crosses are the T1D data and the blue dots are 1000 random samples from the non-T1D data. The left and right plots are the real and simulated data respectively. For both the real and simulated data the T1D samples occupy similar parts of the first two principal components space.
The first two principal components for the European T1D and non-T1D data from the UKBB dataset. The red crosses are the T1D data and the blue dots are 1000 random samples from the non-T1D data. (Left) Real data. (Right) Simulated data.
4 Discussion
The overall aim of this work was to calculate GRSs without access to the original genotype data. The quality of the simulated data was compared to real data both via the summary statistics relating to the GRS and the SNP arrays.
The mean, median, and standard deviation of the simulated GRSs are similar to the equivalents for the real data for both the 1000G data and UKBB. In addition, an AUC produced by considering T1D cases in UKBB and comparing to non-T1D controls are similar for simulated and real data. Therefore, summary statistics from the simulated data would produce the similar conclusions as using real genetic data. Given a common use of GRSs is to calculate discrimination between cases and controls via AUC, we have demonstrated that the simulated data would produce similar AUC as the real data.
From the summary statistics produced by the simulated data it could be concluded that the simulation approach produces a GRS similar to that produced on real data. However, it is important to understand if there is variation that is hidden when considering summary statistics, which might result in false final results in other circumstances.
To investigate the simulation quality further, we considered both sub-parts of the GRS and the generated SNP arrays themselves. For the sub-parts of the GRS, we looked at the generation of the HLA, non-HLA, and interaction parts. There was no difference in the mean scores for these three subparts of the GRS between simulated and real data for either the 1000G data or the UKBB data.
We also considered the SNPs that are correlated and the non-correlated SNPs. If the models were not correctly producing the correlations between the SNPs, then the differences might be expected to occur in those SNPs which were correlated with at least one other SNP as these needed to be optimised by the algorithm. However, we see no differences between simulated and real mean scores generated using either correlated or non-correlated SNPs.
The metrics summarising the GRS, even sub-parts of the GRS, might conceal variations in the distributions of the GRSs. These distributions can vary between different runs of the simulation model (see Fig. 4) making it difficult to be confident that the simulated models are producing the same distributions as the real data. None of the variation in distributions is likely to have an effect on any conclusions drawn though, and the real data is also drawn from a much larger real distribution, but we cannot be certain that our simulated model is correctly estimating the entire distribution.
When considering the relationship between real and simulated data at the SNP-level, the frequencies of the SNPs are defined to be correct, and the correlations are optimised by the algorithm. The pattern of SNP values overall might be different though. To investigate these patterns between the SNPs we took the first two principal components of the real and simulated SNP arrays.
When performing PCA for all the 1000G data and comparing the real and simulated first two principal components (see Fig. 6) the patterns produced are similar for the real and simulated SNP arrays. We see the same pattern of separation of super-populations, which means that the simulation is capturing similar distributions of SNPs at the super-population level.
We demonstrated that the level of the population of the SNP summary statistics is particularly important for some of the datasets. In particular for populations like the 1000G AMR super-population with known genetic variation, where the simulated SNP arrays are insufficiently separated if we take frequencies and correlations at the super-population level (see Fig. 7). By simulating the SNP arrays using the summary statistics at the population level we produce distributions across the first two principal components that are more similar to the real SNP arrays.
The simulation model also generates some structures in PCA space which are similar to real data. We showed this in Fig. 8, where the PCA plots of real and simulated data have some structural similarities. A partial cause of this structure was also shown in Fig. 8 where the value of one SNP was driving a lot of the variance in the first principal component.
We also simulated the T1D data at the SNP-array level where the T1D sits in the same part of the PCA space as the real T1D data. There is some variation in the structure in the first two PCs here.
Overall, the simulated GRS method produces data which appears to be similar to the real data. This is the case both if we consider summary statistics (mean, median and standard deviation) and discrimination power of the GRS (AUC between T1D cases and non-T1D controls). We also see no differences in sub-parts of the GRS. At the SNP-array level (considered as the first two principal components) we find similar distributions between real and simulated data.
While we have demonstrated that the model produces simulated data that is similar to the real data, we cannot be sure that there are not failure cases, for some reason where the simulation method would not work. These is some variation in the principal components of the SNP arrays which may be statistical variation or might be real differences in structure. There are also some differences in the summary statistics, which could be for statistical sampling reasons or may be due to genuine variation between real and simulated data.
Another issue is the level of population required for the summary statistics. It is known that the American population has substantial genetic variations, and here we demonstrated that the super-population of the Americans from 1000G is not the correct level to draw summary statistics from. There may be subclasses below that which would be more appropriate.
If there are too many correlated SNPs, or if there are correlated SNPs that have low frequencies, the algorithm can struggle to find an array that fully matches the correlated SNPs. There can also be spurious correlations that can be imposed by the algorithm. These effects are not significant in the datasets we tested the algorithm on.
The approach was designed for a T1D GRSs which has limited numbers of SNPs. The speed and success of this method depends on the number of SNPs involved, the number of correlations between SNPs and the size of the groups of multiple correlated SNPs, in addition to other factors such as the loss tolerance required in predicted correlations. With higher numbers of correlated SNPs the time taken by the algorithm increases. There is additional analysis of the performance of the algorithm with more SNPs on the GitHub repository.
There are several ways that this approach could be used for research. One important application is during collaborations between institutions where GRSs are required on genetic datasets that are only available at one institution. To transfer genetic data from one institution to another is difficult due to legal, ethical and technical constraints; providing SNP frequencies and LD between SNPs bypasses these problems. This approach provides a method to calculate the GRSs on these separated datasets not available to one party. A second application would be to request frequencies and linkage disequilibrium for sets of SNPs from unrelated groups or institutions. The required summary statistics can be calculated using programmes such as PLINK and, for selected sets of SNPs, the computational and time costs of doing so would usually be modest.
Perhaps the most beneficial approach would be for groups to provide the frequencies and linkage disequilibrium for all their genetic data. This would be a more onerous undertaking, especially in terms of computational costs but would provide many research benefits. More modest approaches could be to provide the frequencies for all SNPs, but the LD results for only selected pairs. One selection approach would be to select only those pairs of SNPs above some LD level. An alternative would be to select pairs of SNPs by the importance of a disease or trait; for T1D for example, most of the LD results of interest are in the HLA region, and just those could be reported.
There are other potential use cases such as: enabling students without access to servers or genetic data to be able to conduct research projects, to rapidly calculate GRSs in background populations from the 1000G (and other publicly available datasets) or for use by researchers outside of genetics with no requirement for substantial knowledge about either genetics or particular software such as PLINK.
5 Conclusions
Our approach allows the generation of simulated SNP arrays for GRS calculation as well as the final GRS. The scores calculated from summary statistics show either the same or highly similar structure to real scores generated from genotyped or imputed data. Our method allows comparisons to be made by the calculation of GRSs from summary statistics without the need for access to genotyped arrays.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Hagberg AA , Schult DA, Swart PJ. Exploring network structure, dynamics, and function using networkx. In Varoquaux G, Vaught T, Millman J (eds.), Proceedings of the 7th Python in Science Conference, Pasadena, CA: Texas Advanced Computing Center, 2008, 11–5.
- 2Auton A , Abecasis GR, Altshuler DM. et al A global reference for human genetic variation. Nature 2015;526:68–74.26432245 10.1038/nature 15393 PMC 4750478 · doi ↗ · pubmed ↗
- 3Clayton EW , Evans BJ, Hazel JW et al The law of genetic privacy: applications, implications, and limitations. J Law Biosci 2019;6:1–36.31666963 10.1093/jlb/lsz 007PMC 6813935 · doi ↗ · pubmed ↗
- 4Lewis CM , Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med 2020;12:44.32423490 10.1186/s 13073-020-00742-5PMC 7236300 · doi ↗ · pubmed ↗
- 5Machiela MJ , Chanock SJ. Ldlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 2015;31:3555–7.26139635 10.1093/bioinformatics/btv 402PMC 4626747 · doi ↗ · pubmed ↗
- 6Montana G. Hapsim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics 2005;21:4309–11.16188927 10.1093/bioinformatics/bti 689 · doi ↗ · pubmed ↗
- 7Oram RA , Patel K, Hill A et al A type 1 diabetes genetic risk score can aid discrimination between type 1 and type 2 diabetes in young adults. Diabetes Care 2016;39:337–44.26577414 10.2337/dc 15-1111 PMC 5642867 · doi ↗ · pubmed ↗
- 8Qin Y , Yalamanchili HK, Qin J et al The current status and challenges in computational analysis of genomic big data. Big Data Res 2015;2:12–8.
