Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks
Andrew D. Blevins, Ian K. Quigley

TL;DR
This paper reveals that machine learning models predicting chemical activity can exploit chemist-specific biases, rather than true structure-activity relationships, highlighting a confound in public benchmark datasets.
Contribution
It demonstrates the existence of a 'Clever Hans' bias in chemical activity prediction models and proposes methods to mitigate this confound by dataset splitting practices.
Findings
Models can predict chemist identity from molecular data with 60% top-5 accuracy.
Author-only models achieve similar performance to structure-based models.
Chemist intent significantly influences activity predictions, not just molecular structure.
Abstract
Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Biomedical Text Mining and Ontologies · Machine Learning in Materials Science
