Exploring the efficacy of molecular fragments of different complexity in computational SAR modeling
Albrecht Zimmermann, Bj\"orn Bringmann, Luc De Raedt

TL;DR
This paper compares simple sequence-based molecular fragments with complex graph-based fragments in SAR modeling, finding that simpler fragments often outperform complex ones due to lower correlation and better class distinction.
Contribution
The study challenges the assumption that more complex fragments are always better for SAR modeling, demonstrating the effectiveness of low-complexity sequences.
Findings
Low-complexity fragments outperform complex ones in predictive accuracy.
Pairwise correlation among fragments affects their usefulness.
Using significance thresholds reduces features with minimal performance loss.
Abstract
An important first step in computational SAR modeling is to transform the compounds into a representation that can be processed by predictive modeling techniques. This is typically a feature vector where each feature indicates the presence or absence of a molecular fragment. While the traditional approach to SAR modeling employed size restricted fingerprints derived from path fragments, much research in recent years focussed on mining more complex graph based fragments. Today, there seems to be a growing consensus in the data mining community that these more expressive fragments should be more useful. We question this consensus and show experimentally that fragments of low complexity, i.e. sequences, perform better than equally large sets of more complex ones, an effect we explain by pairwise correlation among fragments and the ability of a fragment set to encode compounds from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Computational Drug Discovery Methods · Advanced Proteomics Techniques and Applications
