Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction
Malikussaid, Septian Caesar Floresko, Ade Romadhony, Isman Kurniawan, Warih Maharani, Hilal Hudan Nuha

TL;DR
This study reveals heteroskedasticity issues in linear models predicting lipophilicity, shows tree-based models are more robust, and clarifies multicollinearity effects using SHAP analysis in QSAR.
Contribution
It demonstrates the limitations of linear regression for logP prediction and advocates for ensemble methods, providing a new interpretative framework for feature importance in QSAR.
Findings
Linear models exhibit severe heteroskedasticity in lipophilicity prediction.
Tree-based ensemble models outperform linear models in predictive accuracy.
SHAP analysis uncovers hidden multicollinearity effects affecting feature importance.
Abstract
Lipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP greater than 5) compared to balanced regions (logP 2 to 4). Classical remediation strategies (Weighted Least Squares and Box-Cox transformation) failed to resolve this violation (Breusch-Pagan p-value less than 0.0001 for all variants). Tree-based ensemble methods (Random Forest R-squared of 0.764, XGBoost R-squared of 0.765) proved inherently robust to heteroskedasticity while delivering superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Bioinformatics · Bioinformatics and Genomic Networks
