Investigation of a Data Split Strategy Involving the Time Axis in Adverse Event Prediction Using Machine Learning
Katsuhisa Morita, Tadahaya Mizuno, and Hiroyuki Kusuhara

TL;DR
This study compares time-based and random data splitting strategies in machine learning models for adverse event prediction, highlighting differences in performance and potential confounding issues, emphasizing the importance of appropriate evaluation methods.
Contribution
The paper provides a comprehensive comparison of time and random data splits in adverse event prediction, revealing the impact on model performance and confounding risks.
Findings
Random split yields higher AUC than time split for most targets.
Chemical space similarity suggests applicability domain alone doesn't explain performance differences.
Knowledge-based information may introduce confounding in time split evaluations.
Abstract
Adverse events are a serious issue in drug development and many prediction methods using machine learning have been developed. The random split cross-validation is the de facto standard for model building and evaluation in machine learning, but care should be taken in adverse event prediction because this approach does not match to the real-world situation. The time split, which uses the time axis, is considered suitable for real-world prediction. However, the differences in model performance obtained using the time and random splits are not clear due to the lack of the comparable studies. To understand the differences, we compared the model performance between the time and random splits using nine types of compound information as input, eight adverse events as targets, and six machine learning algorithms. The random split showed higher area under the curve values than did the time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Protein Structure and Dynamics
