# Decoding of Inconsistent Biological Data: A Critical Step toward Enhanced AI Predictivity in Drug Discovery

**Authors:** Mira A. M. Behnam, Andrea Cavalli, Diana Lousa, Cláudio M. Soares, Christian D. Klein

PMC · DOI: 10.1021/acsptsci.5c00677 · 2025-12-14

## TL;DR

This paper discusses how inconsistent biological data from different sources can reduce the accuracy of AI models in drug discovery and suggests ways to address this issue.

## Contribution

The paper highlights the impact of assay protocol changes and proposes strategies to improve AI predictivity by addressing data inconsistencies.

## Key findings

- Changes in buffer composition and experimental setup can introduce noise in ML training data.
- Enzymes and viral surface proteins are affected by extrinsic factors, impacting AI model accuracy.
- LLMs and agentic AI may offer new ways to enhance drug discovery efforts.

## Abstract

Combining bioactivity data of assays against the same
target, which
are obtained from different sources, was recently shown to lead to
considerable noise for training data sets of machine learning (ML)
models. In this Viewpoint, we address the profound impact originating
from often overlooked changes to an assay protocol relating to the
buffer composition and experimental setup. We cover two examples of
protein targets that undergo conformational changes driven by extrinsic
factors: enzymes as catalytically active proteins, and viral surface
proteins as structural targets. We discuss strategies to tackle this
challenge for the case of enzyme inhibitors/binders, the utility of
models based on deep learning (DL), and current limitations of computational
studies assessing protein–ligand interactions. In an interview
with an expert in the field of large language models (LLMs) and agentic
AI, we explore how the latest developments in these areas can be leveraged
to support drug discovery efforts.

## Full-text entities

- **Genes:** E (envelope protein) [NCBI Gene 43740570], SPR (sepiapterin reductase) [NCBI Gene 6697] {aka SDR38C1}, RBBP9 (RB binding protein 9, serine hydrolase) [NCBI Gene 10741] {aka BOG, RBBP10}, Mpro [NCBI Gene 8673700], S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}
- **Diseases:** DL (MESH:D007859), LLMs (MESH:D007806)
- **Chemicals:** phosphate (MESH:D010710), glycerol (MESH:D005990), NaCl (MESH:D012965), HEPES (MESH:D006531), 2-aminobenzoic acid- (MESH:C031385), -valine- (MESH:D014633), ethylene glycol (MESH:D019855), Boceprevir (MESH:C512204), tryptophan (MESH:D014364), CHAPS (MESH:C028213), boronic acid (MESH:D001897), Bz-nKRR-AMC (-), B (MESH:D001895), citrate (MESH:D019343)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Lassa virus [taxon 11620], Tick-borne encephalitis virus (no rank) [taxon 11084], Dengue virus (no rank) [taxon 12637], West Nile virus (no rank) [taxon 11082], Zika virus (no rank) [taxon 64320], Yellow fever virus (no rank) [taxon 11089], Orthomyxoviridae (family) [taxon 11308]
- **Mutations:** arginine-arginine-7, serine-glycine-2
- **Cell lines:** His6 — Cricetulus griseus (Chinese hamster), Hybrid cell line (CVCL_2875)

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12797157/full.md

---
Source: https://tomesphere.com/paper/PMC12797157