FPGA acceleration of GWAS permutation testing

Yaniv Swiel; Jean-Tristan Brandenburg; Mahtaab Hayat; Wenlong Carl Chen; Mitchell A Cox; Scott Hazelhurst

PMC · DOI:10.1093/bioadv/vbaf145·June 18, 2025

FPGA acceleration of GWAS permutation testing

Yaniv Swiel, Jean-Tristan Brandenburg, Mahtaab Hayat, Wenlong Carl Chen, Mitchell A Cox, Scott Hazelhurst

PDF

Open Access

TL;DR

This paper introduces an FPGA-based tool that dramatically speeds up GWAS permutation testing, making it feasible for large datasets without needing specialized hardware or expertise.

Contribution

The novel contribution is an FPGA-based cloud-deployable tool for accelerating GWAS permutation testing with significant performance gains.

Findings

01

The FPGA tool completed 1000 maxT permutations in 22 minutes versus 7 days for PLINK on 40 CPU cores.

02

For 100 million adaptive permutations, the FPGA tool required 325 minutes compared to 8.5 days for PLINK.

03

The tool handled 700 million adaptive permutations in 33 hours, a task requiring over a month on CPUs.

Abstract

Summary: Genome-wide association studies (GWASs) analyse genetic variation across many individuals to identify single-nucleotide polymorphisms (SNPs) associated with complex traits. They typically include millions of SNPs from thousands of individuals, creating a multiple testing problem where the probability of false associations increases with the number of SNPs tested. While permutation testing provides accurate control of false positive rates, it is computationally expensive and slow for large datasets. This research presents an FPGA-based tool designed for cloud deployment on AWS EC2 instances that significantly accelerates GWAS permutation testing for continuous phenotypes. The tool implements two algorithms: maxT and adaptive permutation testing. Performance comparisons using a breast cancer dataset (13.7 million SNPs from 3652 individuals) showed large speedups over PLINK…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

FPGA

Diseases1

breast cancer

Figures3

Click any figure to enlarge with its caption.

Architecture of an EC2 f1.2xlarge instance.

Plots comparing the run time of the FPGA-based maxT algorithm to PLINK showing a significant performance advantage: (A) run time of dataset 1, (B) run time of dataset 2, and (C) speed-up of FPGA solution.

Plots comparing the run time of the FPGA-based adaptive algorithm to PLINK. Run time of (A) dataset 1 and (B) dataset 2; speed-up of (C) dataset 1 and (D) dataset 2.

Tables1

Table 1.. VU9P resource utilization percentage of the FPGA design.a

	Total	SLR 0	SLR 1	SLR 2
CLBs	52	61	70	25
CLB LUTs as Logic	28	37	34	12
CLB LUTs as memory	8	9	10	5
CLB Registers/Flip-flops	22	26	30	10
BRAM	20	30	22	8
UltraRAM	31	20	53	20
DSPs	15	12	23	12

Equations5

Funding1

—National Institutes of Health Common Fund/National Human Genome Research Institute

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · VLSI and Analog Circuit Testing · Software Testing and Debugging Techniques

Full text

1 Introduction

Low-cost, high-density genotyping enables collection of large datasets that provide a representative sample of the genetic variation. A genome-wide association study (GWAS) attempts to associate single-nucleotide polymorphisms (SNPs) with a phenotype using genome-wide data sampled from many individuals. Once a GWAS identifies phenotype-associated SNPs, the SNPs can be studied to gain insight into the genetic mechanisms underpinning complex phenotypes. GWASs have been widely applied to the association of SNPs with complex human diseases and other traits (Visscher et al. 2017). Identifying disease-associated variants aids understanding the genetic basis of disease, allowing for improvements in treatment, detection, and susceptibility prediction.

To capture a large amount of genetic variation and increase the chance of detecting associated SNPs, modern GWASs include millions of SNPs sampled from many individuals. GWASs typically examine the independent association of each SNP with the phenotype, performing statistical hypothesis tests on each (by computing a test statistic for each hypothesis test) under the null hypothesis that no association exists. Rejection of the null hypothesis implies that some form of association exists between a SNP and the phenotype. Including many SNPs in a GWAS gives rise to the multiple testing problem, where the cumulative chance of false rejections of the null hypothesis (or false associations) increases with the number of hypothesis tests. GWASs compensate for multiple hypothesis testing by controlling the false positive rate (or the Type 1 error rate), but there is a critical trade-off between suppressing false associations while maximising the number of valid associations.

There are a number of methods for controlling false positives when multiple testing (Dudoit and van der Laan 2008), but many of these assume the theoretical null distribution of the test statistics, which can result in ineffective control of the Type 1 error rate. Permutation testing (PT) is a straightforward and accurate method of minimizing the Type 1 error rate, but is very computationally expensive as the test statistics must be computed thousands to millions of times depending on the version of PT used. Though a number of tools [e.g. PLINK (Chang et al. 2015), GEMMA, FaST-LMM] perform GWAS analyses on large datasets, there is a need for a GWAS PT accelerator.

FPGAs (Field-Programmable Gate Arrays) are reconfigurable circuits that provide a high level of parallelization that can be harnessed to accelerate GWAS PT; although their widespread uptake has been inhibited by their high cost, the availability of cloud-based FPGA platforms providing FPGAs at <US$1.65 per hour (2023) has made powerful FPGAs accessible to bioinformatics labs, and their difficulty to program is now the chief barrier.

FPGAs have been used to accelerate a number of computationally expensive bioinformatics algorithms including read mapping and phylogenetic tree construction. We are not aware of previous work with FPGAs for PT in GWASs, though there is work in genome-wide association interaction studies (GWAIS): Gundlach et al. (2016) present an FPGA-based PT accelerator for GWAIS running on a cluster of 128 low-cost FPGAs. Unfortunately, the purchase and setup of an FPGA cluster is not a viable solution for a typical bioinformatics lab.

Our paper presents a FPGA-based tool to accelerate GWAS permutation testing for continuous phenotypes. An AWS EC2 image is available, which can be used to analyse standard PLINK-format files.

Approach: We present an FPGA-based tool for the acceleration of GWAS PT. There is a hardware component (the FPGA design) and a C++ program running on a host computer, which manages the FPGA execution using OpenCL (an open-source heterogeneous computing framework) and performs the complex branching logic that is difficult to effectively implement in hardware. The tool accelerates two commonly used PT algorithms, which are implemented in the PLINK software toolset (Chang et al. 2015): maxT permutation testing (Westfall and Young 1993) and adaptive permutation testing (Che et al. 2014). The tool requires PLINK-format as input. To make the tool accessible to users without access to FPGAs, it is designed to run on an Amazon Web Services (AWS) Elastic Cloud Compute (EC2) FPGA instance. The tool can be run from the command-line without any knowledge of FPGAs. The only costs are Amazon’s charges for storage and hourly use of EC2 instances.

Two datasets were used for testing and validation of the FPGA-based accelerator, comparing to PLINK. Dataset 1 is a simulated dataset of 1 898 052 SNPs sampled from 2018 individuals. The construction of the synthetic dataset is discussed in Section S1 of the supplementary data at Bioinformatics Advances online. Dataset 2 is a real, imputed dataset of 13 742 528 SNPs sampled from 3652 individuals (2572 histologically confirmed female African breast cancer cases recruited by the Johannesburg Cancer Study and 1080 matched population-based controls recruited by the AWI-Gen study). Genotyping was done using the Illumina H3Africa Custom Array containing 2.2 million SNPs. Imputation was done through the Sanger Imputation Server using the African Resources Panel. As the accelerator is designed to handle continuous phenotypes, the case-control cancer phenotypes were linearized using a generalized linear model.

2 Methods

2.1 Association testing

Linear regression is often used to perform association testing for continuous phenotypes and the F-statistic is used to determine the statistical significance of each association test by quantifying the goodness of fit of the linear regression model. For a genotype–phenotype sample of n individuals, the F-statistic is defined as

[eqn]

where $[eqn]$ is Pearson’s correlation coefficient

[eqn]

where Y is an $[eqn]$ vector of continuous phenotypes and X is an $[eqn]$ vector of SNP alternate allele counts ( $[eqn]$ ).

2.2 Permutation testing

Including millions of SNPs in a GWAS increases the chance of detecting associated SNPs, but makes false associations (i.e. Type 1 errors) effectively inevitable when using the traditional P-value cut-off $[eqn]$ . Adjusting $[eqn]$ (the Type 1 error rate) to control for multiple hypothesis testing is therefore critical factor in obtaining valid results from a GWAS.

Knowledge of the test statistics’ null distribution in a multiple testing scenario would allow more effective control of the Type 1 error rate. In practice, however, the true null distribution is unknown. PT controls the Type 1 error rate by simulating the true null distribution of the test statistics using the observed data (Che et al. 2014). As PT makes no assumptions about the theoretical null distribution of the test statistics, it can be more effective at controlling the Type 1 error rate than the traditional multiple testing corrections (e.g. Bonferroni, Benjamini and Hochberg).

Randomly permuting phenotypes among the individuals in a GWAS destroys the relationship between genotype and phenotype and represents a sampling under the null hypothesis. A PT-adjusted P-value is obtained by ranking the observed test statistic among the permuted test statistics. Permuted P-values are defined as $[eqn]$ , where x is the rank of the observed test statistic and b is the number of permutations (Zhang et al. 2012).

Generating PT-adjusted P-values with enough resolution to compare against the standard GWAS significance threshold of $[eqn]$ needs at least $[eqn]$ permutations for each SNP. Consequently, a number of methods have been developed to reduce the computational expense of permutation testing, with two of the more popular methods described here.

maxT Permutation Testing: The maxT permutation procedure (Westfall and Young 1993) strongly controls the family-wise error rate (FWER)—the probability of at least one Type 1 error—by generating a null distribution using the maximum test statistic from each permutation iteration. A PT-adjusted P-value can then be calculated by ranking the observed test statistic against the collection of maximum test statistics. As the maxT method simulates a null distribution using the maximum test statistic of each iteration, much fewer permutations are required to generate a representative null distribution. In fact, at $[eqn]$ , 1000 permutations are typically sufficient, independent of the number of SNPs (Goeman and Solari 2014).

Adaptive Permutation Testing: Adaptive permutation (Che et al. 2014) identifies and stops permuting insignificant SNPs early in the permutation process so that more accurate P-values can be calculated for interesting SNPs. Adaptive permutation sets a limit R on the number of times the permuted test statistics can exceed the observed statistic. Permuted test statistics are computed until either R test statistics are greater than the one observed (requiring B permutations) or the maximum number of permutations b have been performed. A P-value is then estimated as

[eqn]

Although the maxT and adaptive permutation algorithms alleviate some of the computational burden of GWAS PT, hardware acceleration can change the scale of the problem that PT can address.

2.3 Architecture of an AWS EC2 FPGA instance

FPGAs are reconfigurable circuits that can implement arbitrary digital circuits. In contrast to CPUs and GPUs, which have fixed architectures and instruction sets, FPGAs do not have a defined architecture and must be configured to perform a specific function. The adaptability of FPGAs can be exploited to design highly optimized architectures, and FPGAs can be significantly faster and more power efficient than sequentially operating, general-purpose hardware such as CPUs, even though FPGAs are typically clocked an order of magnitude lower. FPGAs are often used as hardware accelerators in heterogeneous computing platforms consisting of a host processor and one or more hardware accelerators. A heterogeneous architecture allows the host processor to perform complex conditional branching logic (difficult to effectively implement in FPGA hardware), while computationally intensive tasks are offloaded to an optimized hardware accelerator.

AWS EC2 FPGA instances consist of a Xilinx Virtex UltraScale+ VU9P FPGA connected to a host CPU as illustrated by Fig. 1. The host CPU uses OpenCL to communicate with the FPGA via a PCIe interface. OpenCL hides calls to Xilinx runtime (XRT)—a set of communication drivers developed and maintained by Xilinx for Linux.

Xilinx provides a pre-configured target platform for AWS FPGA instances, which comprises the FPGA logic for the VU9P FPGA together with external DDR memory accessible by both the host processor and the FPGA (i.e. global memory). The FPGA logic consists of a static region (which contains I/O, status monitoring, and lifecycle management logic) as well as a configurable region for user-developed kernels. The maximum possible frequency of the FPGA design is 250 MHz.

2.4 FPGA-based PT on a heterogeneous platform

PT for one SNP requires the repeated computation of the test statistic F with each permutation of the phenotype vector Y. When Y is permuted, the denominator of $[eqn]$ remains constant while the numerator changes with each permutation. This suggests that the computation of the numerator of $[eqn]$ is the critical part of the permutation testing algorithm.

The numerator of $[eqn]$ —the dot product of an $[eqn]$ vector of phenotypes ( $[eqn]$ ) and an $[eqn]$ vector of genotypes ( $[eqn]$ )—is defined as:

[eqn]

The genotype vector can be expressed as $[eqn]$ , where, if the additive genotypic model is used, $[eqn]$ is a vector of SNP alternate allele counts for each individual ( $[eqn]$ ). If the GWAS includes c covariates for each individual, the phenotype vector can be expressed as $[eqn]$ , where $[eqn]$ is the real-valued phenotype vector, and C is an $[eqn]$ matrix of covariates with a prepended column of ones. If covariates are not included, $[eqn]$ .

PT for m SNPs sampled from n individuals requires the computation of the dot product between the permuted phenotype vector $[eqn]$ and each genotype vector $[eqn]$ . The core part of the PT algorithm is a matrix vector multiplication (MVM) between an $[eqn]$ matrix of genotypes and the $[eqn]$ phenotype vector, resulting in an $[eqn]$ vector of dot products. As MVM is the most computationally expensive part of PT, the most effective way of accelerating PT on a heterogeneous FPGA-CPU platform is to use the parallelism of the FPGA to accelerate the MVM, while using the CPU to perform the less computationally intensive (although still demanding) tasks such as data preprocessing, management of the FPGA execution, and implementation of the PT algorithms.

2.5 FPGA design

The primary focus of the FPGA design process was the MVM kernel that provides an acceptable level of accuracy while minimizing latency and maximizing throughput to effectively accelerate the MVM calculation. The fact that GWASs aim to identify potentially associated SNPs rather than confirming hypotheses about specific SNPs means than an approximately correct PT P-value is typically sufficient for estimating statistical significance. Furthermore, as PT P-values are not calculated directly from a test statistic but, rather, from the comparison between test statistics, the computation of extremely high-precision dot products was not required.

The input to the MVM kernel is the genotype data and the phenotype vector. The output is a vector of dot products to be used by the host application to compute the permuted test statistic for each SNP. The genotype data for any reasonably sized GWAS is too large to store in the VU9P FPGA’s on-chip RAM (36.1 Mb distributed RAM + 75.9 Mb block RAM + 270 Mb UltraRAM $[eqn]$ 48MB on-chip RAM), so it is copied to global memory and streamed to the FPGA for each permutation. The transfer of the genotype data is therefore a major performance bottleneck.

To maximize the throughput of the FPGA, the design minimizes the performance impact of transferring data via global memory by optimizing the data types used within the MVM kernel (Sections 2.5.1 and 2.5.2); optimizes global memory read/write performance (Section 2.5.3); takes advantage of the inherent parallelism provided by the FPGA hardware (Section 2.5.4); and exploits the architecture of the AWS FPGA instance (Section 2.5.5). The FPGA design was also optimized to ensure that it can operate at the maximum possible frequency of the VU9P FPGA on EC2 F1 instance (250 MHz) while simultaneously minimizing the hardware resource utilization of the VU9P FGPA. Table 1 indicates that the VU9P FPGA provides ample resources for the design.

2.5.1 Compression of the genotype data

The genotype data is much larger than the phenotype vector and the result vector, so the global memory bandwidth is primarily used to transmit the genotype data from host memory to global memory. A major advantage of FPGAs over CPUs and GPUs is their ability to natively operate on arbitrary precision data types. This allows FPGAs to take advantage of the compressed nature of SNP genotype data, which can be one of four values (0, 1, 2, or missing). Genotype data can, therefore, be represented with a 2-bit data type, and performing the $[eqn]$ calculation within the MVM kernel (rather than the host application) can significantly reduce the impact of memory bandwidth on the accelerator’s efficiency. As the genotype data does not change throughout permutation testing, $[eqn]$ is only calculated once for each SNP and, to ensure that the MVM kernel does not have to wait to receive an entire genotype vector before initiating a dot product computation, $[eqn]$ is calculated by the host application and transferred to the kernel along with the genotype and phenotype data.

2.5.2 Fixed-point data types

The MVM kernel uses fixed-point types rather than floating-point types to represent fractional data as fixed-point operations use fewer FPGA hardware resources, have a lower memory footprint, and have lower latency than equivalent floating-point operations. Fixed-point designs also consume less power than equivalent floating-point designs.

2.5.3 Data buffering

The phenotype vector is buffered within the kernel’s internal memory so it can be reused for each dot product computation. Not buffering would require the vector to be streamed into the kernel with each SNP vector. This would significantly impact performance as the global memory bandwidth would be split transmitting both the phenotype and genotype data.

A disadvantage is that buffering the phenotype data within the kernel fixes an upper bound on the size of the phenotype buffer when the FPGA design is compiled, limiting the sample size of the datasets that can be analysed. Kernels can be compiled with a large phenotype buffer to accommodate very large sample sizes—on our implementation using the VU9P, the buffer supports samples of $[eqn]$ individuals. To support larger samples, the host application could be adapted to split the phenotype vector and calculate partial dot products using the MVM kernel (not currently implemented). NB: there is no limit on the number of SNPs.

2.5.4 Loop optimization

The MVM kernel uses the full width of the global memory interface (512 bits) to reduce the number of global memory accesses during computation, simultaneously increasing kernel parallelism. The following loop optimizations result in a kernel initiation interval of one clock cycle (i.e. the MVM kernel can process a new 512-bit block of genotype data each clock cycle).

Loop unrolling supports multiple loop iterations in parallel by replicating the loop hardware several times.Loop pipelining is used to overlap loop iterations, so a new iteration can be started before the previous iteration completes, allowing the hardware to be maximally utilized.Array partitioning is used to allow the phenotype buffer to be accessed in parallel by splitting the array across different RAM resources.

The MVM kernel loops over the rows of the genotype matrix (which store the sampled data of a single SNP) and processes each row in blocks of 512-bits or 256 SNP values. The outer loop, which loops over the rows of the genotype matrix, is not pipelined so as to reduce data dependencies and limit the hardware utilization of the FPGA design. The middle loop, which loops over a row of the genotype matrix in 512-bit blocks, is pipelined to overlap processing of each 512-bit block. The innermost loop, which performs the multiply-accumulate operations, is fully unrolled to parallelize the processing of each 512-bit block of genotype data.

2.5.5 Topological optimizations

Multiple MVM kernels are instantiated on the FPGA to allow the accelerator to process different blocks of data in parallel. It was found that four MVM kernels maximized the global memory bandwidth, and additional kernels did not result in an increase in performance.

The global memory of the AWS FPGA instance is split into four 16 GB banks, so each memory bank is connected to one of the four MVM kernels so as to maximize the system memory bandwidth. All data transferred to a kernel via global memory is transferred using the same memory bank.

2.6 Host application design

The C++ host application is optimized to read PLINK data from input files, converting to the format expected by the MVM kernels; implement data blocking and buffer management; execute the FPGA kernels with OpenCL; implement the permutation testing algorithms; and record the results.

2.6.1 Data preprocessing

GWAS data is read from industry standard PLINK-formatted files and preprocessed before being sent to the FPGA kernels.

A Python 3 script performs covariate/mean adjustment on the phenotype data using the NumPy, pandas, and statsmodels packages. The script reads phenotypes from the PLINK fam file and writes adjusted phenotypes (either $[eqn]$ or $[eqn]$ , depending on whether a covariate data file is provided) to a file which is read by the host application. The host application converts the phenotypes to the fixed-point format expected by the MVM kernel.

An OpenMP-accelerated (v3.1) algorithm was developed to accelerate genotype preprocessing as it was found to influence the overall run time of the PT process, particularly when the number of SNPs is large and the number of permutations is low ( $[eqn]$ ). The algorithm uses a nested loop pair to process a PLINK bed file. The inner loop iterates over each byte of a SNP data block and converts the PLINK-encoded genotype data to the 0, 1, and 2 additive genotype format. The inner loop also sums the converted SNP values and the number of non-missing genotypes to calculate the mean of each SNP vector. The outer loop (parallelized with OpenMP) iterates over each SNP in the bed file, converts the double-valued mean of each SNP vector to the fixed-point format expected by the MVM kernel, and calculates the standard deviation of each SNP vector.

2.6.2 Data blocking

The host application takes advantage of the parallelism of the MVM accelerator by blocking the GWAS dataset so that different blocks of data can be processed concurrently by each of the four MVM kernels. Global memory bandwidth is dependent on the size of the data buffer being transferred, so the host application blocks the genotype data based on the optimal buffer size, which determines the number of kernel invocations as well as the number of dot products computed by each MVM kernel i.e.

[eqn]

2.6.3 Kernel execution with OpenCL

The host application uses OpenCL (v1.2) API calls to execute the MVM kernels instantiated on the FPGA and to manage data transfer via global memory. OpenCL API calls are used to initialize the kernels, transfer data to/from the kernels, execute kernel tasks, and synchronize kernel execution based on OpenCL events. All OpenCL API calls used to control kernel execution are asynchronous (i.e. they immediately return after the command has been queued on the OpenCL work queue), so events are used to resolve dependencies between kernel commands. Event-based synchronization allows the host application to perform other work while OpenCL manages the work queue.

2.6.4 Permutation testing algorithms

The two PT algorithms were designed with a view towards minimizing the overall run time by minimizing the idle time of the FPGA kernels. The host application uses multithreading to allow a new permutation to be started while the results of previous permutations are being processed.

The adaptive PT algorithm is more complex than the maxT algorithm as SNPs are periodically removed from the procedure when they do not appear to be significant, which requires the FPGA data buffers to be regenerated. Unlike the maxT algorithm, which has a run time that is affected only by the number of permutations, the adaptive algorithm has a number of parameters, along with the number of permutations, which affect the run time, namely the SNP dropping interval, the rate at which the dropping interval increases, and the minimum number of permutations for each SNP.

The minimum permutations per SNP (M-P-S) is a user-defined parameter. The SNP dropping interval is initially equal to the M-P-S and increases by 20% every 5 drops. At each SNP dropping interval, the host application regenerates the data buffers to remove the SNPs, which have been flagged as insignificant and re-runs the data blocking algorithm on the regenerated data.

Once enough SNPs have been dropped so that the SNP data can fit into one buffer, the host application transitions to performing one permutation per MVM kernel. During this phase of the algorithm, the host application adaptively sets the size of the genotype data buffers to ensure that the accelerator throughput is maximized. The host application continues to drop SNPs and regenerate the data buffers, but the dropping interval increases more rapidly as only significant SNPs remain and fewer SNPs are removed at each interval.

3 Results

The speed and accuracy of the FPGA-accelerated permutation algorithms were compared to PLINK (Chang et al. 2015)—a popular, highly optimized C/C++ command line tool that implements both maxT and adaptive permutation testing—using two test datasets. PLINK was used as a performance reference as its extensive algorithmic optimizations make it a state-of-the-art tool for running GWAS permutation testing. The FPGA-accelerated permutation algorithms were tested on an AWS EC2 f1.2xlarge instance and PLINK testing was done on a computer cluster with two Intel Xeon Silver 4114 CPUs (each consisting of 20 2.2 GHz hyper-threaded cores) per node.

The accuracy of the FPGA accelerator was compared to PLINK by comparing the permuted P-values to determine whether the same significant SNPs are identified with the same accuracy. As PT is an inherently random process, some variation in the results is expected, but the P-values of the most significant SNPs should be very similar. To simulate a realistic GWAS PT scenario, the P-value comparison was conducted with maxT P-values generated with 1000 permutations and adaptive P-values generated with at least 20 million permutations.

These tests were performed without covariates (i.e. the phenotypes were pre-adjusted with covariates) as if a covariate file is included PLINK recalculates the phenotypes for each permutation, which significantly reduces the speed of the PT procedure. The FPGA-based algorithms perform the covariate adjustment once at the start of the algorithm, so the inclusion of covariates can be expected to have a minimal effect on the run time of the FPGA-based algorithms.

3.1 maxT permutation testing

3.1.1 Speed

The run time of the FPGA-based maxT algorithm was compared to that of PLINK running on the cluster. Discussion of experimental choices can be found in Section S2 of the supplementary data at Bioinformatics Advances online. The effect of large-scale parallelization of PLINK’s maxT implementation was determined by splitting the PLINK maxT algorithm over multiple nodes of the CPU cluster by triggering an instance of PLINK on each of the nodes (where each PLINK instance is required to perform $[eqn]$ maxT permutations) and combining the results (resulting in parallelization over 40, 200, or 400 CPU cores). The multi-node PLINK tests were only run with the smaller, simulated dataset (Dataset 1); however, it was not possible to reserve enough time on the cluster to process the real, fully imputed dataset (Dataset 2).

Figure 2A compares the run time of the FPGA-accelerated maxT algorithm to PLINK for varying numbers of maxT permutations. The FPGA-based maxT algorithm is at least two orders of magnitude faster than PLINK running on 40 CPU cores for any number of permutations. Figure 2B shows that the run time of the FPGA-based maxT algorithm is more than an order of magnitude faster than highly parallelized PLINK.

Architecture of an EC2 f1.2xlarge instance.

For 1000 maxT permutations, which provides effective control of the Type 1 error rate for any number of SNPs, the FPGA-based maxT algorithm is 346 times faster than 40-core PLINK for Dataset 1 and 450 times faster than 40-core PLINK for Dataset 2 (see Fig. 2C), suggesting that the FPGA-based maxT accelerator scales well for larger datasets. Based on the December 2023 hourly cost of an AWS f1.2xlarge instance (USD1.65 for an on-demand instance and USD0.6 for a spot instance), we estimate that the FPGA-based solution costs between USD0.4–1.1 while an AWS CPU-based solution would cost between USD80 and 145 depending on the instance type used and spot pricing available.

3.1.2 Accuracy

The accuracy of the FPGA-based maxT algorithm was determined using the significant SNPs identified by PLINK as a reference. Using an arbitrary P-value cutoff of $[eqn]$ , PLINK and our tool identified exactly the same 8 significant SNPs from Dataset 1 and 30 significant SNPs from Dataset 2, with no false positives by the FPGA-based maxT algorithm.

3.2 Adaptive permutation testing

3.2.1 Speed

Data dependencies of the adaptive permutation algorithm prevent parallelization of PLINK on multiple computers, so the run time of the FPGA-based adaptive permutation algorithm was compared to the run time of PLINK running on one 40-core computer. To determine the impact of the M-P-S parameter on run time, we tested Dataset 1 with two M-P-S values (121 and 36) following Che et al. (2014) and Dataset 2 with an M-P-S of 121 due to time constraints.

Figure 3A and B compare the run time of the FPGA-accelerated adaptive permutation algorithm to PLINK by plotting the respective run times against the maximum permutations per SNP. The FPGA-accelerated algorithm is at least one order of magnitude faster than PLINK for any number of adaptive permutations.

Plots comparing the run time of the FPGA-based maxT algorithm to PLINK showing a significant performance advantage: (A) run time of dataset 1, (B) run time of dataset 2, and (C) speed-up of FPGA solution.

Plots comparing the run time of the FPGA-based adaptive algorithm to PLINK. Run time of (A) dataset 1 and (B) dataset 2; speed-up of (C) dataset 1 and (D) dataset 2.

Figure 3C and D plot the FPGA speed-up against the number of permutations for the two test datasets. The speed-up achieved for the adaptive permutation algorithm is lower than the speedup achieved for the maxT algorithm. This is caused by two aspects of the adaptive permutation algorithm:

The algorithm is much more CPU-dependent than the maxT method—the host application must periodically identify and remove insignificant SNPs.In the later stages of the algorithm, the FPGA accelerator is not operating at peak efficiency due to the use of a suboptimal buffer size.

In spite of this, the FPGA-accelerated adaptive PT algorithm is significantly faster than PLINK for any number of permutations.

Figure 3C and D also show that the FPGA speedup decreases as the number of adaptive permutations increases—the sub-optimal buffer size when most of the SNPs have been dropped reduces the FPGA accelerator’s efficiency. It still retains a speed advantage over PLINK—particularly in the case of Dataset 2. For 700 million adaptive permutations of Dataset 2 (an almost unfeasible workload for PLINK running on a 40-core CPU), the run time of the FPGA-accelerated algorithm was 33 h—much less than PLINK’s run time of $[eqn]$ 8.5 days for 100 million permutations of the same dataset. For 100 million adaptive permutations of Dataset 2, we estimate the AWS cost to be between USD4 and 10 for the FPGA-based solution while the CPU-based solution would cost between USD100 and 175.

Figure 3C further shows that although the M-P-S has an effect on FPGA speedup when few permutations are performed, this effect becomes negligible as the number of permutations increases. M-P-S only has a significant effect on the run time when many SNPs are undergoing permutation.

3.2.2 Accuracy

The accuracy of the FPGA-based adaptive algorithm was determined using the significant SNPs identified by PLINK as a reference. Using an arbitrary P-value cutoff of $[eqn]$ , PLINK identified 8 significant SNPs after 20 million permutations of Dataset 1 and 50 significant SNPs after 100 million permutations of Dataset 2. The FPGA-based algorithm identified the same significant SNPs as PLINK, although one false positive occurred for Dataset 1. This can be explained by the random nature of the adaptive PT algorithm (an RNG is used to permute the phenotypes) and the fact that more permutations were performed for Dataset 2, so the P-values calculated for Dataset 2 are more accurate than those generated for Dataset 1.

3.2.3 Comparison to GPUs

GPUs are an obvious alternate method of improving the performance of parallel workloads. PBOOST (Yang et al. 2015) is an example of a GPU-based PT accelerator for GWAIS, but it only supports categorical phenotypes and cannot directly handle confounding variables.

Recently, John et al. (2022) published the method permGWAS with a GPU implementation of linear mixed model permutation testing. Unfortunately, a direct comparison is not easy because of different features of the tools. Nevertheless, a rudimentary comparison is informative. permGWAS supports SNP data stored in the PLINK binary format and requires imputed data as it tolerates no missing data. Preliminary experimentation using an Amazon p3.2xlarge instance for permGWAS and an f1.2xlarge instance for our tool on a randomly generated dataset with no missing data of 2000 individuals and 500 000 SNPs showed that our tool was substantially faster and also scaled better (7 s versus 412 s for 10 permutations; 9 s versus 1572 s for 100 permutations; permGWAS could not run with 200 permutations). This preliminary experimentation may indicate that much greater parallelism can be exposed on an FPGA architecture.

3.2.4 Additional experimentation

Additional experimentation can be found in Section S3 of the supplementary data at Bioinformatics Advances online.

4 Discussion and conclusion

This work presents an FPGA-based tool that accelerates PT compared to running parallelized CPU implementations by at least an order of magnitude, implementing both adaptive PT and maxT PT. The tool supports standard PLINK files—documentation on how to use the tool is provided at github.com/witseie/fpgaperm. Alternatively, users with a compatible FPGA may download the code and compile themselves.

As the accelerator is designed to run on an AWS EC2 f1.2xlarge instance consisting of a Xilinx Virtex UltraScale+ VU9P FPGA together with an 8-core Intel Xeon host CPU, the PT algorithm was split into a hardware component and a software component. Due to the fact that MVM is the most computationally expensive part of the PT algorithm, the FPGA accelerates the MVM computation, while the host CPU handles preprocessing, managing the FPGA execution and data transfers to/from the FPGA using the OpenCL framework, and implementing the PT algorithms.

PLINK, a popular CPU-based tool that implements both maxT and adaptive permutation testing, was used as a reference with which to compare the speed and accuracy of the FPGA-based algorithms. For a real GWAS dataset of 13.7 million SNPs sampled from 3652 individuals, the run time of PLINK running on 40 Intel Xeon 4114 CPU cores was measured at 164 h (or approximately 7 days) for 1000 maxT permutations and 204 h (or around 8.5 days) for 100 million adaptive permutations. In contrast, the run time of the FPGA-based accelerator is 22 min for 1000 maxT permutations and 325 min for 100 million adaptive permutations resulting in speedups of 447 and 38, respectively.

FPGA acceleration enables the handling of workloads that are almost computationally infeasible for state-of-the-art CPU-based methods. This proves that FPGAs can be effectively employed as bioinformatics accelerators in order to handle the large sample sizes of modern biological datasets, thereby opening up new possibilities for bioinformatics research.

While this work has demonstrated that FPGAs can effectively accelerate GWAS permutation testing for continuous phenotypes using simple linear regression, work can be done both to improve the functionality of the accelerator and to improve the performance.

The accelerator cannot currently handle sample sizes of more than 262 144 individuals due to the fixed phenotype buffer size. A useful optimization would be to determine the largest phenotype buffer that does not affect the operating frequency. The accelerator could also be updated to support the analysis of case/control phenotypes, which would require the development of a novel FPGA architecture as case/control phenotypes are analysed using logistic regression (or Fisher’s exact test) rather than linear regression.

Further work could be done to improve the efficacy of the FPGA-based adaptive permutation testing algorithm, but this will probably require a redesign of the FPGA part of the accelerator to process multiple permutations in a single invocation of a kernel.

Further comparison to GPU-based solutions is warranted. While largely out of scope of this work, a rudimentary comparison between our tool and an existing GPU-based tool (permGWAS) indicates that the GPU-based implementation is significantly slower. Nevertheless, a robust comparison between GPU and FPGA-based acceleration of GWAS permutation testing should be undertaken to determine whether comparable speedup can be achieved with GPU acceleration. Additional work could include the analysis of the combined effect of FPGA and GPU acceleration.

FPGA acceleration of linear mixed model GWAS permutation testing should be investigated. LMM regression is used to account for genetically related individuals and the fact that LMM regression is significantly slower than simple linear regression means that there is a need for an LMM permutation testing accelerator.

Supplementary Material

vbaf145_Supplementary_Data

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chang CC , Chow CC, Tellier LC et al Second-generation PLINK: rising to the challenge of larger and richer datasets. Giga Science 2015;4:7.25722852 10.1186/s 13742-015-0047-8PMC 4342193 · doi ↗ · pubmed ↗
2Che R , Jack JR, Motsinger-Reif AA et al An adaptive permutation approach for genome-wide association study: evaluation and recommendations for use. Bio Data Mining 2014;7:9–13.24976866 10.1186/1756-0381-7-9PMC 4070098 · doi ↗ · pubmed ↗
3Dudoit S , van der Laan MJ. Multiple Testing Procedures with Applications to Genomics. Springer Series in Statistics. New York: Springer, 2008.
4Goeman JJ , Solari A. Multiple hypothesis testing in genomics. Statistics in Medicine 2014;33:1946–78.24399688 10.1002/sim.6082 · doi ↗ · pubmed ↗
5Gundlach S , Kässens JC, Wienbrandt L. Genome-wide association interaction studies with MB-MDR AND maxt multiple testing correction on FPG As. Procedia Comput Sci 2016;80:639–49. 10.1016/j.procs.2016.05.354 · doi ↗
6John M , Ankenbrand MJ, Artmann C, et al Efficient permutation-based genome-wide association studies for normal and skewed phenotypic distributions. Bioinformatics 2022; 38: ii 5–ii 12.36124808 10.1093/bioinformatics/btac 455PMC 9486594 · doi ↗ · pubmed ↗
7Visscher PM , Wray NR, Zhang Q et al 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 2017;101:5–22.28686856 10.1016/j.ajhg.2017.06.005PMC 5501872 · doi ↗ · pubmed ↗
8Westfall PH , Young SS. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: John Wiley and Sons, Ltd, 1993.