Improving reproducibility of cheminformatics workflows with chembl-downloader
Charles Tapley Hoyt

TL;DR
This paper introduces chembl-downloader, a Python package that enables reproducible, up-to-date access to ChEMBL data, improving transparency and consistency in cheminformatics workflows.
Contribution
The paper presents a new Python tool that simplifies reproducible access to current and specific versions of ChEMBL data for cheminformatics research.
Findings
Provides reproducible data acquisition from ChEMBL
Enables access to latest or specific ChEMBL versions
Facilitates transparent and up-to-date datasets
Abstract
Many modern cheminformatics workflows derive datasets from ChEMBL, but few of these datasets are published with accompanying code for their generation. Consequently, their methodologies (e.g., selection, filtering, aggregation) are opaque, reproduction is difficult, and interpretation of results therefore lacks important context. Further, such static datasets quickly become out-of-date. For example, the current version of ChEMBL is v35 (as of December 2024), but ExCAPE-DB uses v20, Deep Confidence uses v23, the consensus dataset from Isigkeit _et al._ (2022) uses v28, and Papyrus uses v30. Therefore, there is a need for tools that provide reproducible bulk access to the latest (or a given) version of ChEMBL in order to enable researchers to make their derived datasets more transparent, updatable, and trustworthy. This article introduces `chembl-downloader`, a Python package for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Big Data and Business Intelligence · Data Mining Algorithms and Applications
