# Defining the Data set Defines the QSAR Claim

**Authors:** Manal A. Nael, Laxman M. Alakonda, Khaled M. Elokely

PMC · DOI: 10.1021/acs.jcim.6c00514 · 2026-02-27

## TL;DR

This paper introduces data set contracts to improve transparency and reliability in QSAR modeling by clearly defining data processing and evaluation rules.

## Contribution

The novelty is the proposal of data set contracts to standardize and document QSAR modeling practices.

## Key findings

- Inconsistent standardization and hidden leakage often inflate QSAR model performance.
- Data set contracts can make QSAR claims more transparent and reproducible.
- These contracts are feasible using current open-source tools.

## Abstract

Machine learning
has greatly expanded QSAR modeling, but predictive
claims still depend on choices that are rarely documented: how chemicals
are represented, how end points are defined, and how evaluations are
designed. In the era of benchmarks and foundation models, inconsistent
standardization, unclear rules for combining measurements, and hidden
information leakage routinely inflate reported performance while obscuring
weaknesses that matter for real-world applications. We propose data
set contracts: executable, auditable documents that explicitly declare
chemical processing rules, end point definitions, aggregation logic,
data splits, and leakage diagnostics for the intended prediction scenario.
These contracts are feasible with current open-source tools and would
shift the field from architecture-centric comparisons toward claims
that are transparent, reproducible, and trustworthy.

## Full-text entities

- **Chemicals:** Salt (MESH:D012492)

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13014448/full.md

---
Source: https://tomesphere.com/paper/PMC13014448