Neural scaling laws for phenotypic drug discovery
Drew Linsley, John Griffin, Jason Parker Brown, Adam N Roose, Michael, Frank, Peter Linsley, Steven Finkbeiner, Jeremy Linsley

TL;DR
This paper explores how scaling up neural networks and data affects performance in phenotypic drug discovery, introducing a novel pretraining task that improves scalability and accuracy on drug development benchmarks.
Contribution
The study introduces the Inverse Biological Process pretraining task, which enables neural networks to better scale and perform on drug discovery tasks compared to traditional supervised methods.
Findings
DNNs do not improve with scale when trained directly on drug discovery tasks.
Pretraining with IBP significantly enhances DNN performance.
Performance of IBP-trained DNNs improves monotonically with data and model size.
Abstract
Recent breakthroughs by deep neural networks (DNNs) in natural language processing (NLP) and computer vision have been driven by a scale-up of models and data rather than the discovery of novel computing paradigms. Here, we investigate if scale can have a similar impact for models designed to aid small molecule drug discovery. We address this question through a large-scale and systematic analysis of how DNN size, data diet, and learning routines interact to impact accuracy on our Phenotypic Chemistry Arena (Pheno-CA) benchmark: a diverse set of drug development tasks posed on image-based high content screening data. Surprisingly, we find that DNNs explicitly supervised to solve tasks in the Pheno-CA do not continuously improve as their data and model size is scaled-up. To address this issue, we introduce a novel precursor task, the Inverse Biological Process (IBP), which is designed to…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The findings of this paper have an important implication for the field of drug discovery. To achieve better accuracy on each drug discovery task, estimating the required data quantity based on the scaling laws is critical. The neural scaling laws that have been extensively explored in the NLP domain may not be directly transferable to the biological domain. This paper provides a valuable starting point for the exploration of scaling laws in drug discovery and other biological domains.
- The experimental rationales are not clearly specified. This is especially problematic for readers with light domain knowledge, who may be confused about why the authors used certain experimental settings. For example, it is not clear why out-of-distribution samples are only used for the IBP-trained DNN model. - The authors do not provide insights into why their IBP-trained DNN model exhibits linear scaling laws, while vanilla DNNs do not. This is a significant finding, and it would be helpful
This is an early view of an important dataset. The pretraining task seems reasonable. The originality is limited as this is mostly an application of reasonable ideas on an existing dataset, and the implementation is relatively poor. Unfortunately the authors try to oversell their conclusions and extrapolate based on data that spans less than an order of magnitude.
I was a physicist in a past life, so perhaps I am more sensitive than other audiences, because in my opinion scaling laws that extrapolate asymptotically require several orders of magnitude of variation of the inputs (think 3--5 log units, possibly more) to be convincing. This paper hardly scratches the surface on any such respect and is clearly far from the lofty goals discussed in the intro. Overall, the presentation is convoluted at times and makes unreasonable baseline assumptions: was the
The work is very well-motivated and timely, the introduction of the IBP task is interesting, and thinking along the lines of neural scaling for iHCS data is a promising research direction.
The work lacks basic baselines, including evidence that would support the major claim of the unique effectiveness of the IBP pre-training task. The "neural scaling" results are not clearly presented.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Genetics, Bioinformatics, and Biomedical Research
