Defining the Data set Defines the QSAR Claim
Manal A. Nael, Laxman M. Alakonda, Khaled M. Elokely

TL;DR
This paper introduces data set contracts to improve transparency and reliability in QSAR modeling by clearly defining data processing and evaluation rules.
Contribution
The novelty is the proposal of data set contracts to standardize and document QSAR modeling practices.
Findings
Inconsistent standardization and hidden leakage often inflate QSAR model performance.
Data set contracts can make QSAR claims more transparent and reproducible.
These contracts are feasible using current open-source tools.
Abstract
Machine learning has greatly expanded QSAR modeling, but predictive claims still depend on choices that are rarely documented: how chemicals are represented, how end points are defined, and how evaluations are designed. In the era of benchmarks and foundation models, inconsistent standardization, unclear rules for combining measurements, and hidden information leakage routinely inflate reported performance while obscuring weaknesses that matter for real-world applications. We propose data set contracts: executable, auditable documents that explicitly declare chemical processing rules, end point definitions, aggregation logic, data splits, and leakage diagnostics for the intended prediction scenario. These contracts are feasible with current open-source tools and would shift the field from architecture-centric comparisons toward claims that are transparent, reproducible, and trustworthy.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPharmacovigilance and Adverse Drug Reactions · Biomedical Text Mining and Ontologies · Cardiac, Anesthesia and Surgical Outcomes
