Predicting Drug Responses by Propagating Interactions through   Text-Enhanced Drug-Gene Networks

Shiyin Wang

arXiv:1906.08089·cs.SI·June 20, 2019

Predicting Drug Responses by Propagating Interactions through Text-Enhanced Drug-Gene Networks

Shiyin Wang

PDF

Open Access

TL;DR

This paper introduces a method that combines biological literature and experimental data to construct a drug-gene interaction network, enabling explainable predictions of drug responses with high accuracy.

Contribution

It presents a novel approach integrating text-mined interactions and experimental data to predict drug responses in a transparent manner.

Findings

01

Achieved 94.74% accuracy in binary drug sensitivity prediction.

02

Developed a white-box model for explainable drug response prediction.

03

Constructed a drug-gene network from literature and experimental data.

Abstract

Personalized drug response has received public awareness in recent years. How to combine gene test result and drug sensitivity records is regarded as essential in the real-world implementation. Research articles are good sources to train machine predicting, inference, reasoning, etc. In this project, we combine the patterns mined from biological research articles and categorical data to construct a drug-gene interaction network. Then we use the cell line experimental records on gene and drug sensitivity to estimate the edge embeddings in the network. Our model provides white-box explainable predictions of drug response based on gene records, which achieves 94.74% accuracy in binary drug sensitivity prediction task.

Tables3

Table 1. Table 1 . Datasets summary

Dataset	# Gene	# Drug	# Relations	Format
DGIdb	36815	9370	42727	Categorical
PubTator	36815	9370	42727	Descriptive
PubMed⁴⁴4Meta-patterns extracted from a subset of the abstracts on the PubMed.	2575	6199	10530	MetaPattern
RNA-seq(Barretina et al., 2012)	58035	-	-	CL Records
GDSC(Yang et al., 2013)	-	16	-	CL Records
Skeleton	335	587	3321	-

Table 2. Table 2 . Datasets

Dataset	URL
DGIdb	http://www.dgidb.org
PubTator	https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/ Demo/PubTator
PubMed	https://www.ncbi.nlm.nih.gov/pubmed
RNA-seq(Barretina et al., 2012)	https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770/Results
GDSC(Yang et al., 2013)	https://www.cancerrxgene.org

Table 3. Table 3 . Model Performance

Method	Logistic Reg.	SVM	Ours
Accuracy	93.75%	90.14%	94.74%
Explainable	✗	✗	✓

Equations10

\overset{v_{i}}{^} = \frac{λ _{i}}{∣ N ( e _{i} ) ∣} e_{j} \in N (e_{i}) \sum R_{e_{i}, e_{j}} v_{j} + λ_{i} b_{i}

\overset{v_{i}}{^} = \frac{λ _{i}}{∣ N ( e _{i} ) ∣} e_{j} \in N (e_{i}) \sum R_{e_{i}, e_{j}} v_{j} + λ_{i} b_{i}

u_{i} = S (v_{i})

u_{i} = S (v_{i})

L_{g e n e} = \frac{1}{N _{g e n e}} i \in G e n e \sum (\overset{u_{i}}{^} - u_{i})^{2}

L_{g e n e} = \frac{1}{N _{g e n e}} i \in G e n e \sum (\overset{u_{i}}{^} - u_{i})^{2}

L_{c h e m} = \frac{1}{N _{g e n e}} i \in C h e mi c a l \sum (\overset{u_{i}}{^} - u_{i})^{2}

L_{c h e m} = \frac{1}{N _{g e n e}} i \in C h e mi c a l \sum (\overset{u_{i}}{^} - u_{i})^{2}

L = L_{g e n e} + β L_{c h e m}

L = L_{g e n e} + β L_{c h e m}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Biomedical Text Mining and Ontologies · Machine Learning in Bioinformatics

Full text

Predicting Drug Responses by Propagating Interactions through Text-Enhanced Drug-Gene Networks

Shiyin Wang

Massachusetts Institute of Technology

[email protected]

Abstract.

Personalized drug response has received public awareness in recent years. How to combine gene test result and drug sensitivity records is regarded as essential in the real-world implementation. Research articles are good sources to train machine predicting, inference, reasoning, etc. In this project, we combine the patterns mined from biological research articles and categorical data to construct a drug-gene interaction network. Then we use the cell line experimental records on gene and drug sensitivity to estimate the edge embeddings in the network. Our model provides white-box explainable predictions of drug response based on gene records, which achieve 94.74% accuracy in binary drug sensitivity prediction task.

network science, data mining, drug response

††ccs: Information systems††ccs: Information systems Retrieval tasks and goals††ccs: Information systems Information extraction

1. Introduction

Precision medicine depends on discovering the correlations and casualties between observed data(gene, past clinical records, etc) and unobserved future performance(side effects, drug effects, immune response, etc). Categorical relations have been collected from experiments, while more complex meta-pattern information is mined from scientific publications. To bridge the gap between these two types of data collecting sources, we propose a relation embedding method to expand the representation space of the relationships between genes and drug responses. To the best of my knowledge, this is the first approach to combine text patterns into the existing network. Nodes in the network represent to drug responses(sensitivities, side effects, etc.) and genes.

The model pipeline consists of two parts. First, we construct a relation network of gene and drug response from existing formatted data sources. In the same time, we mine data from scientific publications automatically, retrieving meta-patterns indicating the relations between gene and drug responses. Finally, a novel graph convolutional neural network based method is applied to balance the information collected in the two types of data sources.

2. Related Works

Drug discovery plays an essential role in the improvement of human life(Drews, 2000) as the pharmacology had become a well-defined and respective scientific discipline. Dating back to the 20th century, researchers across distinct disciplinary areas, including analytical chemistry and biochemistry, demonstrated the values for the drug response analysis. At that time, researchers have already cast interests in the variation of personal drug response. However, because of the limitation of data collection and analysis models, precision medicine did not arouse widespread concern until these decades(Mirnezami et al., 2012). Recent years, integrating big data, including clinical data, genetic data, genomic data, intervention history, etc., has received expectations for facilitating characterization of side effects and predicting drug resistance.

Network Science results are applied to the data of gene expression profiles by integrating personalized data to predictive records by similarity analysis, which reveal the hidden correlations of unobserved patterns(Menche et al., 2017; Yang et al., 2018). Intuitive speaking, people sharing similar genetic patterns and clinical treatments tend to show similar drug responses. Network-based approaches reveal potential drug-biomarker correlations effectively for some diseases. However, current methods in this aspect are limited in the correlation modeling of the network, and the data accessed is limited in the structured data.

Data Mining communities have been trying to discover, identify, structure, and summary relationships between biological entities through various text mining techniques across open-access biological research publications(Sigdel et al., 2019; Wang et al., 2018c; Shang et al., 2018; Wang et al., 2018a). One of the drawbacks of this approach is the accuracy of the results is not high enough to play a deterministic role in the decision process. Instead, they are provided to the human experts as an additional reference to fasten the knowledge discovery process. On the other hand, Recursive Neural Networks has shown its power to deal with text data(Lipton et al., 2015), demonstrating considerable success in numerous tasks such as image captioning(Devlin et al., 2015; You et al., 2016; Rennie et al., 2017) and machine translation(Cho et al., 2014; Bahdanau et al., 2014; Luong et al., 2015). This technique can be applied to convert text meta-patterns into trainable embeddings to represent the relations among entities.

3. Model

In this section, we make definitions and formalize the problem. The whole project is consist of two parts: constructing network skeleton and inference edge embedding.

In the first part, we first collected all the entities Definition. 1 from multiple data sources. Then we mined the categorical relationships and descriptive relationships Definition.2. Patterns are extracted based on the discovered entities and relationships. After that, meta-patterns are extracted from descriptive patterns to be a high-level representation of entities relations. Regarding entities as nodes in the network, we add edges between two nodes if they appear to have either structured or descriptive relationships.

Definition 1 (Entity).

Entities, denoted by $e$ , refer to the names of chemicals, genes, or diseases which can be indexed by a MESH id or Entrez id. For example, SAH is a disease entity with MESH id D013345.

Definition 2 (Relation).

Relationships, denoted by $r$ , represent the latent interactions between two entities $e_{i}$ , $e_{j}$ . Relations can be either categorical(such as “blocker”, “binder”) or descriptive(such as “With the increase of $e_{i}$ , there is a significantly drop of $e_{j}$ expression level”).

Definition 3 (Pattern).

The pattern, denoted by $(r,e_{i},e_{j})$ , is a tuple of one relationship and two entities. Patterns are categorical or descriptive concerning the type of the corresponding relationships. Generally, patterns are accessed by directly parsing datasets. Meta-Patterns are the simplified version of descriptive patterns.

In the second part, we estimate the representations of nodes and edges Definition. 4. We model the estimated representation of $v_{i}$ is calculated by the average neighbor entities to multiple edge representations as Equation. 1.

Definition 4 (Edge Representation and Node Representation).

Edge representation, denoted as $R_{e_{i},e_{j}}\in\mathcal{M}$ , is the quantified version of relationship defined on a function family. In this project, because of the computation limitation, we set $\mathcal{M}=\mathcal{R}_{k\times k}$ to be a matrix of size $k\times k$ . Then nodes are represented as a vector $v_{i}\in\mathcal{R}^{k}$ , bias $b_{i}$ , and scale $\lambda_{i}$ . The estimated value of $v_{i}$ is given by:

[TABLE]

Having built the network skeleton, we then need to learn the representation $R_{e_{i},e_{j}}$ , $\lambda_{i}$ , $b_{i}$ , $v_{i}$ of edges and nodes.

In the datasets, researchers record TPM (transcripts per million) to quantify genes and use intensity to measure drug sensitivity. To convert representations into a numeric format, we deploy a fully connected neural network with one hidden layer of size 2 and sigmoid activation.

[TABLE]

We define the loss function as:

[TABLE]

4. Implementation

4.1. Collecting Structured Data

Drug-Gene interactions are well studied in the past biological researches, such as CancerCommons, CF:Biomarkers, CGI, NCI, DrugBank, PharmGKB, DoCM, etc. SIDER111sideeffects.embl.de provides adverse drug reactions (ADRs) representing drug responses, which contains 1430 drugs, 5880 ADRs and 140,064 drug-ADR pairs. The Cancer Therapeutics Response Portal (CTRP)222portals.broadinstitute.org/ctrp/ links genetic, lineage, and other cellular features of cancer cell lines to small-molecule sensitivity with the goal of accelerating the discovery of patient-matched cancer therapeutics. In this paper, we use the drug-gene interaction database (DGIdb)(Rees et al., 2016; Basu et al., 2013; Seashore-Ludlow et al., 2015), which contains drug-gene interactions and gene druggability information from 30 different sources. Figure LABEL:fig:count_plot shows the frequency of drug-gene interaction types in DGIdb, showing a long-tailed distribution. The inhibitor occurs 9982 times in DGIdb, which is 40.79% of the whole interaction records. The second most frequent interaction is agonist, which occurs 5333 times and accounts for 21.79%. There are 30 types of relationships. We assume the founded structured relations are correct and precise.

4.2. Mining Descriptive Relationship

PubMed abstracts are a good source to obtain descriptive patterns. We investigated a subset of annotated PubMed data by PubTator(Wei et al., 2013). We selected all the sentences that contain two annotated chemicals or genes. The total size of the file is approximate 2 gigabyte, which is too complex to take the original long sentences into the model. The first reason is that many patterns may occur a few times if we use sentences to model the relations directly. Secondly, our computation power does not have enough memory to process large graphs. Therefore, we further process our descriptive patterns to short representations, which we call meta-patterns.

4.3. Extract Meta-Patterns

For decades, numerous biological researchers describe their findings on interactions between chemicals, drugs, genes, etc., by natural languages. The adequate biological research literature provide a good source to extract the information using information retrieval methods(Shang et al., 2017; 0001 et al., 2017; Shang et al., 2016). We collected 3614796 abstracts from PubMed and used the PubTator dataset to discover all the named entities in chemicals, diseases, and genes. Meta pattern(Wang et al., 2018b; 0001 et al., 2017) is defined as a frequent, informative, and precise subsequence pattern in certain context. The first step of meta-pattern discovery is context-aware segmentation, which estimates the contextual boundaries of meta-patterns. We use the method developed by Wang(Wang et al., 2018b), which extracted relationships from a subset of the abstracts on the PubMed(Davis et al., 2017) with the semi-supervise from CTD database333http://ctdbase.org. For example, the meta-patterns in this sentence from a PubMed abstract are shown in the Figure. 1

4.4. Linking Categorical Relations and Descriptive Relations

One of the challenges in this project is the entity linking across multiple datasets, which refer to the alignment of entity mentions to its entity id. This process is very tedious because of the messy index format. For example, chemicals(drugs) have CID, PMID, Chembl, Cas Number, etc. We use the entity name as a validation criterion. Though the same entity may have different names in different data sources, those names are similar with high probability. Therefore, we can measure the correctness of our entity linking algorithm by comparing merged names. The summary of all the data sources used in this project are listed in Table. 1 2.

4.5. Inference Individual Drug Responses

To determine the parameters in the model, we acquire cell line experiment records of 58035 gene expression on 934 human cancer cell lines (Barretina et al., 2012) and 3723759 drug sensitivity test records on 1065 cell lines(Yang et al., 2013). We assume the same cell lines perform similarly in these two data sources. Then we can align these two data sources by cell line names. The data summary is shown in Table. 1.

Because we have to use experimental data to train the representations, we need to shrink the graph, deleting useless nodes. We pick the nodes from observable entities(red) with distance 1 or 2 to the observable entities. The results are shown on Figure. 3(a), 3(b), 3(c), 3(d).

We set the embedding dimension to be 4. The learning rate of gene optimizer is 0.1 so that it can converge quickly in 1. The learning rate of chemical is 0.001.

We have done experiments on the network of Figure. 3(b). The binary classification accuracy is 91.15%. Because network data is not structured, it is hard to draw a ROC or some other training visualization.

5. Evaluation

We test our model on the extracted 539 drug-gene records linked by cell-line. The data contain test records of 7 drugs and 21 genes, and about 42.12% cells are missing. The inadequacy of data limits the performance of complex models. We compare with Logistic Regression and Support Vector Machine. We use sklearn package for Logistic Regression, with the L1 penalty and max interation 10000. We use radial basis function kernel in SVM model with auto gamma. Our method outperforms these two baselines.

Our model can be explainable because it is based on the representation of categorical and descriptive relations. Human experts can guide the model by changing the network skeleton. Therefore, it can potentially avoid the risk of black-box decision and provide an interpretation of its decisions.

6. Future Work

Collect More Records

The experiment records are not adequate to test our model in scale. Patient profiles will be a good data source for our model. But due to the privacy restrictions, we do not have them right now. The application of our model on clinical data can also help to provide precise and interpretable patient profiles.

Attention Mechanism

Because of the limitation of gene-drug records, we do not apply sophisticated design in this model. Intuitively, the drug responses will be better modeled if we allow the weight of different relations to change according to their current value.

7. Conclusion

Research articles are excellent sources for machines to predict, infer, reasoning, etc. By aggregating multiple data sources, we study and analyze the association of drug and gene by constructing networks with categorical relations and descriptive relations. The resulted network skeleton presents the relationships between entities. We model the relations as a kernel function between entities. After that, we use cell line experiment records to estimate the parameters of our model, which achieves 94.74% accuracy on predicting the drug sensitivity for cell lines.

Acknowledgements.

Thank Dr. Qi Li for supporting meta-pattern results on PubTator.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
20001 et al . (2017) Meng Jiang 0001, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timothy P Hanratty, and Jiawei Han 0001. 2017. Meta PAD - Meta Pattern Discovery from Massive Text Corpora. Co RR cs.CL (2017).
3Bahdanau et al . (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 (2014).
4Barretina et al . (2012) Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin, Sungjoon Kim, Christopher J Wilson, Joseph Lehár, Gregory V Kryukov, Dmitriy Sonkin, Anupama Reddy, Manway Liu, Lauren Murray, Michael F Berger, John E Monahan, Paula Morais, Jodi Meltzer, Adam Korejwa, Judit Jané-Valbuena, Felipa A Mapa, Joseph Thibault, Eva Bric-Furlong, Pichai Raman, Aaron Shipway, Ingo H Engels, Jill Cheng, Guoying K Yu, Jianjun Yu, Peter Aspesi, Melanie d
5Basu et al . (2013) A. Basu, N. E. Bodycombe, J. H. Cheah, E. V. Price, K. Liu, G. I. Schaefer, R. Y. Ebright, M. L. Stewart, D. Ito, S. Wang, A. L. Bracha, T. Liefeld, M. Wawer, J. C. Gilbert, A. J. Wilson, N. Stransky, G. V. Kryukov, V. Dancik, J. Barretina, L. A. Garraway, C. S. Hon, B. Munoz, J. A. Bittker, B. R. Stockwell, D. Khabele, A. M. Stern, P. A. Clemons, A. F. Shamji, and S. L. Schreiber. 2013. An interactive resource to identify cancer genetic and lineage dependencies targe
6Cho et al . (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078 (2014).
7Davis et al . (2017) Allan Peter Davis, Cynthia J Grondin, Robin J Johnson, Daniela Sciaky, Benjamin L King, Roy Mc Morran, Jolene Wiegers, Thomas C Wiegers, and Carolyn J Mattingly. 2017. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Research 45, D 1 (Jan. 2017), D 972–D 978.
8Devlin et al . (2015) Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. ar Xiv preprint ar Xiv:1505.01809 (2015).