Decoding of Inconsistent Biological Data: A Critical Step toward Enhanced AI Predictivity in Drug Discovery
Mira A. M. Behnam, Andrea Cavalli, Diana Lousa, Cláudio M. Soares, Christian D. Klein

TL;DR
This paper discusses how inconsistent biological data from different sources can reduce the accuracy of AI models in drug discovery and suggests ways to address this issue.
Contribution
The paper highlights the impact of assay protocol changes and proposes strategies to improve AI predictivity by addressing data inconsistencies.
Findings
Changes in buffer composition and experimental setup can introduce noise in ML training data.
Enzymes and viral surface proteins are affected by extrinsic factors, impacting AI model accuracy.
LLMs and agentic AI may offer new ways to enhance drug discovery efforts.
Abstract
Combining bioactivity data of assays against the same target, which are obtained from different sources, was recently shown to lead to considerable noise for training data sets of machine learning (ML) models. In this Viewpoint, we address the profound impact originating from often overlooked changes to an assay protocol relating to the buffer composition and experimental setup. We cover two examples of protein targets that undergo conformational changes driven by extrinsic factors: enzymes as catalytically active proteins, and viral surface proteins as structural targets. We discuss strategies to tackle this challenge for the case of enzyme inhibitors/binders, the utility of models based on deep learning (DL), and current limitations of computational studies assessing protein–ligand interactions. In an interview with an expert in the field of large language models (LLMs) and agentic…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · vaccines and immunoinformatics approaches · Machine Learning in Bioinformatics
