A large dataset curation and benchmark for drug target interaction
Alex Golts, Vadim Ratner, Yoel Shoshan, Moshe Raboh, Sagi Polaczek,, Michal Ozery-Flato, Daniel Shats, Liam Hazan, Sivan Ravid, Efrat Hexter

TL;DR
This paper introduces a standardized large dataset for drug target interaction prediction, along with a benchmark protocol, to improve comparability and validity of computational drug discovery research.
Contribution
It presents a comprehensive data curation, standardization, and splitting strategy, along with an evaluation protocol for DTI prediction models.
Findings
The dataset enables consistent benchmarking across studies.
The benchmark protocol improves comparability of results.
Experimental validation confirms the dataset's usefulness.
Abstract
Bioactivity data plays a key role in drug discovery and repurposing. The resource-demanding nature of \textit{in vitro} and \textit{in vivo} experiments, as well as the recent advances in data-driven computational biochemistry research, highlight the importance of \textit{in silico} drug target interaction (DTI) prediction approaches. While numerous large public bioactivity data sources exist, research in the field could benefit from better standardization of existing data resources. At present, different research works that share similar goals are often difficult to compare properly because of different choices of data sources and train/validation/test split strategies. Additionally, many works are based on small data subsets, leading to results and insights of possible limited validity. In this paper we propose a way to standardize and represent efficiently a very large dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Cell Image Analysis Techniques · Biomedical Text Mining and Ontologies
