Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Thor Klamt; Wolfgang Nejdl; Ming Tang

arXiv:2605.11764·cs.LG·May 13, 2026

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Thor Klamt, Wolfgang Nejdl, Ming Tang

PDF

TL;DR

This paper decomposes the large generalization gap in PROTAC activity prediction, identifying inter-laboratory measurement variance as the main factor and proposing methods to mitigate it.

Contribution

It introduces a variance-decomposition framework for understanding generalization gaps and demonstrates how measurement variance limits predictive performance in PROTAC activity models.

Findings

01

Inter-laboratory measurement variance dominates the generalization gap.

02

Hyperparameter tuning cannot surpass the performance ceiling set by measurement variance.

03

Few-shot learning and calibration improve target-specific AUROC scores.

Abstract

Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.