DrivR-Base: a feature extraction toolkit for variant effect prediction model construction

Amy Francis; Colin Campbell; Tom R Gaunt

PMC · DOI:10.1093/bioinformatics/btae197·April 11, 2024

DrivR-Base: a feature extraction toolkit for variant effect prediction model construction

Amy Francis, Colin Campbell, Tom R Gaunt

PDF

Open Access

TL;DR

DrivR-Base is a tool that simplifies the extraction of genomic variant features for predicting disease-related genetic effects.

Contribution

DrivR-Base introduces a reproducible and integrative toolkit for variant feature extraction from diverse genomic data sources.

Findings

01

DrivR-Base extracts features from databases like AlphaFold, ENCODE, and Variant Effect Predictor.

02

The tool is deployable via Docker for consistent use across computational environments.

03

Features generated can be used for pathogenic impact prediction, haploinsufficiency analysis, and drug repurposing.

Abstract

Recent advancements in sequencing technologies have led to the discovery of numerous variants in the human genome. However, understanding their precise roles in diseases remains challenging due to their complex functional mechanisms. Various methodologies have emerged to predict the pathogenic significance of these genetic variants. Typically, these methods employ an integrative approach, leveraging diverse data sources that provide important insights into genomic function. Despite the abundance of publicly available data sources and databases, the process of navigating, extracting, and pre-processing features for machine learning models can be highly challenging and time-consuming. Furthermore, researchers often invest substantial effort in feature extraction, only to later discover that these features lack informativeness. In this article, we introduce DrivR-Base, an innovative…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Tables1

Table 1.. Amino acid substitution matrices and their sources.

Matrix type	Source
PAM40	Pelé et al. 2012
PAM160	Pelé et al. 2012
PAM250	Pelé et al. 2012
BLOSUM30	Henikoff and Henikoff 1992
BLOSUM45	Henikoff and Henikoff 1992
BLOSUM62	Henikoff and Henikoff 1992
GONNET	Gonnet et al. 1992
JTT	Jones et al. 1992
JTT_TM	Jones et al. 1994
PHAT	Ng et al. 2000

Funding1

—Cancer Research UK10.13039/501100000289

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Rare Diseases · Genomics and Phylogenetic Studies · RNA and protein synthesis mechanisms

Full text

Introduction

The rapid advancement in Next Generation Sequencing technologies has facilitated the extensive identification of variants within the human genome. A significant number of these variants have an unknown functional impact. Among these, many could potentially contribute to disease phenotypes as driver variants, while others are likely to be passively involved and causatively neutral in nature.

In response, a diverse range of machine learning methodologies have been proposed, with the primary objective of integrating genome-level information (features) to identify driver variants. Notable tools in this context include DeepMinds’ most recent piece of work, AlphaMissense, (Cheng et al. 2023), our FATHMM-MKL (Shihab et al. 2015) and CScape (Rogers et al. 2017) predictors, as well as CADD (Rentzsch et al. 2019), DANN (Quang et al. 2015), PolyPhen-2 (Adzhubei et al. 2013), and EVE (Frazer et al. 2021). While these tools employ diverse methodologies to tackle genomic prediction problems, the datasets, or features, integrated into the models prove equally crucial, and the utility of these classifiers heavily relies on the availability of feature data.

To our knowledge, DrivR-Base represents the first tool available to the research community that offers such a comprehensive and extensive compilation of annotations across the entire genome (Wang et al. 2010, McLaren et al. 2016, Liu et al. 2020). With its unique capability to integrate a wide array of detailed features and annotations from numerous databases, DrivR-Base stands out for its unparalleled breadth and depth of genomic and protein-level information accessible for extraction. Moreover, most modern tools focus on aggregating scores from machine learning models associated with a variant, rather than providing access to the raw annotations themselves (Liu et al. 2020). DrivR-Base, therefore, provides an unprecedented resource for the direct application in machine learning models to accelerate the development of variant prediction tools.

To date, numerous features have demonstrated their effectiveness in assessing the likelihood of a variant driving disease. Conservation-based features, such as PhyloP and PhastCons scores (Siepel et al. 2005, Pollard et al. 2009), quantify sequence conservation across species. Studies have suggested that regions with lower conservation tend to be less functionally significant (Woodruff 2001). These features have proven informative in several predictors (Shihab et al. 2015, Rentzsch et al. 2019, Sun and Yu 2019, Cabrera-Alarcon et al. 2022).

Additionally, various other features have played vital roles in driver-variant prediction. For instance, the Variant Effect Predictor (VEP) (McLaren et al. 2016) has been instrumental in developing widely-used prediction tools (Shihab et al. 2015, Rentzsch et al. 2019). VEP provides valuable insights into variant effects on transcripts within protein-coding regions, introns, and regulatory elements. Moreover, this context has seen the utilization of features such as sequence-based similarity measures, enabling mathematical comparisons of wild-type and mutant string patterns (e.g. spectrum kernels), as well as regulatory features from ENCODE (Dunham et al. 2012, Quang et al. 2015, Shihab et al. 2015, Rogers et al. 2017, Rentzsch et al. 2019). Additionally, information on GC content and CpG islands has proven valuable in these prediction tasks (Shihab et al. 2015, Rogers et al. 2017). Elevated GC content has been associated with increased bendability and the ability to undergo B-Z transitions, which are spatial features linked to open chromatin and active transcription (Vinogradov 2003).

While various feature groups are currently in use, additional molecular datasets could likely offer valuable insights in predicting driver variants. For instance, exploring the influence of single nucleotide variants (SNVs) on DNA shape properties is one such illustration. Multiple DNA shape properties have been implicated in DNA–protein interactions (Jones et al. 2003, Rohs et al. 2009, Chiu et al. 2017). Specifically, high electrostatic potentials have been linked to DNA binding sites (Jones et al. 2003, Chiu et al. 2017), and the narrowing of minor grooves has been associated with A-tracts, resulting in bending toward the minor groove (Rohs et al. 2009). As a result, SNVs occurring at these sites may disrupt these interactions and could lead to functional consequences.

Furthermore, other features that have not been extensively explored in this context include structural information sourced from the AlphaFold (Jumper et al. 2021) and PDB (Berman et al. 2000) databases. These databases contain a wealth of information that could prove valuable when assessing whether a genomic variant is likely to lead to disease. Other examples of feature groups that have not been widely employed thus far and are presented in this work include dinucleotide and amino acid properties.

In this paper, we introduce the creation of a novel repository, named DrivR-Base, designed to streamline the data acquisition process for constructing robust predictors of variant driver status. These datasets have broader applications, including the development of haploinsufficiency prediction models (Shihab et al. 2017) and potential adaptation for advancing drug repurposing tools (Irham et al. 2022). We focus on the human genome, providing users with a comprehensive toolkit of scripts, documentation, and links to original sources to build the required feature set. The deployment of bioinformatics tools across varied computational environments often presents a significant challenge due to dependency management and configuration issues. To address this, we have containerized DrivR-Base using Docker, ensuring that researchers can deploy our toolkit effortlessly, without the need to manage individual software dependencies. Further details can be found in the Supplementary section.

Description and implementation

DrivR-Base is a feature extraction toolkit that enables efficient integration of genomic and protein-level annotations for all possible combinations of single nucleotide variants in the GRCh38 build of the human genome (including all four possible nucleotides at a given position). The resulting features have a wide range of applications, including direct integration into machine learning models for variant effect prediction. The output of DrivR-Base is a single file where the variants are represented as rows, with a column dedicated to feature values for each of the attributes described below. The tool is fully containerised for Docker, facilitating straightforward installation and execution. The tool extracts information for ten different feature groups (FG) from human single nucleotide variants, which are mainly extracted from public databases:

Conservation-based features: Conservation-based features encompass several crucial metrics. These include PhyloP and PhastCons (Siepel et al. 2005, Pollard et al. 2009) scores, which assess whether nucleotide substitution rates deviate from the expectations under neutral drift. Each of these scores is obtained using seven different alignment methods. Additionally, our analysis incorporates Umap and Bismap mappability data (Karimzadeh et al. 2018), measured using four different types of species alignment methods. These metrics assess the extent to which a genomic region can be accurately mapped during sequencing, providing insights into the reliability of genomic or epigenomic characteristics. Regions exhibiting lower mappability readings may be more prone to error. To obtain these datasets for the entire genome, we retrieve data from the UCSC genome browser (Kent et al. 2002) and tailor our queries to specific input variants. Variant Effect Predictor: The VEP (McLaren et al. 2016) is organized into three main groups of features. Firstly, we extract all predicted transcript consequences for each variant and encode them using one-hot encoding. The outcome is a file that displays a “1” in the corresponding row for each variant if the transcript consequence is predicted. Next, we retrieve the predicted wild-type and mutant amino acids, presenting the results in two files. The first file follows a BED+2 format, with the final two rows representing the wild-type and mutant amino acids, respectively. For synonymous variants, the amino acids will be the same. Additionally, we generate another file that is one-hot encoded, making it suitable for direct integration into the user’s models. Finally, we extract distances to transcripts. When variants are predicted to affect multiple transcripts, we calculate their mean, maximum, and minimum distances. Dinucleotide properties: This feature dataset is sourced from DiProDB, an extensive database encompassing 125 conformational and thermodynamic dinucleotide properties (Friedel et al. 2009), which provides values for four dinucleotide configurations: (a) The wild-type allele paired with the adjacent allele on the left, (b) The wild-type allele paired with the adjacent allele on the right, (c) The mutant allele paired with the adjacent allele on the left, and (d) The mutant allele paired with the adjacent allele on the right. The resulting table contains columns, each representing one of the 125 different properties. Column names include a prefix specifying which of the four configurations it pertains to. For example, “1_Propeller_Twist” denotes the value for the propeller twist property in the first configuration. DNA shape properties: Here, we incorporate five DNA shape properties from DNAShapeR (Chiu et al. 2016). DNAShapeR employs a sliding-window approach to calculate minor groove width (MGW), helix twist (HelT), propeller twist (ProT), roll (Roll), and electrostatic potential (EP). In our scripts, we extract DNA shape features within a window of +10 and −10 on either side of the variant, but this can be easily adjusted by the user. The output is presented in a table, displaying the value for each DNA shape feature for every calculated base pair, where position 11 corresponds to the variant of interest. GC content and CpG sites: DrivR-Base also calculates GC content, CpG counts and observed CpG versus expected CpG ratios for nine different window sizes. Kernel-based sequence similarity: Our approach also employs sequence-based p-spectrum kernels to capture potential disruptions in sequences flanking a single nucleotide variant (Campbell and Ying 2011). Spectrum kernels allow us to assess the composition of k-mers within the genomic regions surrounding a mutation. We explore various window sizes ranging from 2 to 20 and k-mer sizes ranging from 1 to 20. For each chosen window size (w), we systematically generate all possible combinations of specified k-mer sizes for both wild-type and mutant sequences. We then determine the frequency of occurrence for each k-mer in the respective sequences using the following mapping function:

[eqn]

Here, u represents the sub-string k-mer of length p, $[eqn]$ denotes the wild-type sequence, $[eqn]$ refers to the mutant sequence, and s represents the sequence of interest. We subsequently derive a p-spectrum kernel by summing the products of corresponding row entries for the two sequences:

[eqn]

In this equation, s corresponds to the wild-type sequence, and t corresponds to the mutant sequence. We calculate the diagonals of the p-spectra by summing the squares of corresponding row entries within the mapping function matrix. For a more comprehensive explanation and detailed Python implementation, please refer to our Supplementary material and GitHub Repository. Amino acid substitution matrices: In this study, we extract amino acid substitution rates from a variety of matrices for non-synonymous variants sourced from the Bio2mds package in R (Pelé et al. 2012). The matrices used and their sources are shown in Table 1. Amino acid properties: DrivR-Base retrieves 532 amino acid properties for both wild-type and mutant amino acid sequences. These properties were sourced from the AAindex data within the AAsea package in R (Reddy 2019). They encompass information related to factors such as polarity, hydrophobicity, local flexibility, and helix-bend preferences. ENCODE database features: ENCODE offers a wealth of functional information about the human genome (Dunham et al. 2012). In this work, we extract eight features potentially informative for variant pathogenicity:

Transcription Factor ChIP-seq
Histone ChIP-seq
DNase-seq
Mint-ChIP-seq
ATAC-seq
eCLIP
ChIA-PET
GM DNase-seqTo achieve this, we retrieve all available files for each feature group from ENCODE via the ENCODE API. Subsequently, we download, convert, and consolidate ENCODE peak files into comprehensive data frames for each feature group. These data frames include metadata like accession, target (e.g. transcription factor), biosample (e.g. cell/tissue type), and output type (e.g. narrow peak). Note that this script downloads all ENCODE data locally, requiring approximately 160GB of space.Next, we cross-reference feature-specific databases with target SNVs, extracting relevant information overlapping with SNV locations. We then extract crucial data such as signal values, P-values, q-values, and peaks for each variant. For cases with multiple peaks, such as when replicate assays are involved, we also record minimum, maximum, mean, and range values.

AlphaFold structural features: DrivR-Base incorporates structural data from the AlphaFold database (Jumper et al. 2021) and PDB (Berman et al. 2000). Using the VEP query output, we identify genes and protein positions affected by coding variants. Gene names are converted to UniProtKB IDs, and an API retrieves corresponding crystallographic information files (CIF; .cif) from AlphaFold based on the UniProtKB ID. We extract structural information, including X, Y, and Z atom coordinates, isotropic atomic displacement parameters (IADP), and structural conformation types. The output includes two data frames: one containing the first four features (X, Y, Z coordinates, and IADP) for each variant, and another data frame with one-hot-encoded structural conformation types indicating potential effects on amino acids, such as bends or helical structures.

A detailed list of feature groups, their sources, and their implementation can be found in our Supplementary material.

Conclusions and future efforts

In summary, DrivR-Base is a versatile cross-database toolkit that consolidates diverse features for human SNVs. These features have various applications, including constructing high-dimensional machine-learning models for predicting variant driver status. As previously commented, DrivR-Base can also be applied to predict haploinsufficient genes and to identify functional similarities to known drug targets, potentially aiding drug repurposing efforts. This tool streamlines feature extraction, saving researchers time and advancing their work. Our future goals include expanding the tool’s capabilities to encompass a broader range of mutations, such as indels, deletions, and structural rearrangements, and diversifying the available feature groups for extraction. DrivR-Base is fully containerised for easy deployment using Docker, ensuring a reproducible and streamlined setup process. Detailed instructions for Docker deployment, including pulling the image, running the container, and executing the toolkit, are available in our comprehensive GitHub documentation at https://github.com/amyfrancis97/DrivR-Base. Researchers are encouraged to contact the authors to discuss the inclusion of additional feature groups in DrivR-Base or the enhancement of existing feature groups.

Supplementary Material

btae197_Supplementary_Data

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adzhubei I , Jordan DM, Sunyaev SR. Predicting functional effect of human missense mutations using polyphen-2. Curr Protoc Hum Genet 2013;Chapter 7:Unit 7.20.10.1002/0471142905.hg 0720 s 76PMC 448063023315928 · doi ↗ · pubmed ↗
2Berman HM , Westbrook J, Feng Z et al The protein data bank. Nucleic Acids Res 2000;28:235–42.10592235 10.1093/nar/28.1.235PMC 102472 · doi ↗ · pubmed ↗
3Cabrera-Alarcon JL , Martinez JG, Enríquez JA et al Variant pathogenic prediction by locus variability: the importance of the current picture of evolution. Eur J Hum Genet 2022;30:555–9.35079159 10.1038/s 41431-021-01034-1PMC 9091277 · doi ↗ · pubmed ↗
4Campbell C , Ying Y. Learning with Support Vector Machines. Kentfield, CA 94914, US: Morgan & Claypool Publishers, 2011.
5Cheng J , Novati G, Pan J et al Accurate proteome-wide missense variant effect prediction with alphamissense. Science 2023;381:eadg 7492.37733863 10.1126/science.adg 7492 · doi ↗ · pubmed ↗
6Chiu TP , Comoglio F, Zhou T et al Dnashaper: an r/bioconductor package for dna shape prediction and feature encoding. Bioinformatics 2016;32:1211–3.26668005 10.1093/bioinformatics/btv 735PMC 4824130 · doi ↗ · pubmed ↗
7Chiu TP , Rao S, Mann RS et al Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein–DNA binding. Nucleic Acids Res 2017;45:12565–76.29040720 10.1093/nar/gkx 915PMC 5716191 · doi ↗ · pubmed ↗
8Dunham I , Kundaje A, Aldred SF et al An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74.22955616 10.1038/nature 11247 PMC 3439153 · doi ↗ · pubmed ↗