Addendum: Data splitting against information leakage with DataSAIL
Roman Joeres, David B. Blumenthal, Olga V. Kalinina

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Web Application Security Vulnerabilities · Data Quality and Management
Addendum to: Nature Communications 10.1038/s41467-025-58606-8, published online 8 April 2025
In our recently published manuscript, we presented the DataSAIL tool for automatically splitting datasets used in the training of machine learning models while minimizing information leakage between the splits (e.g., training, validation, and test sets). In evaluating our method, we utilized several benchmark datasets and compared DataSAIL to existing tools and splitting algorithms. While the publication focused on comparing automated tools for data splitting, many of the datasets used in our study have been published together with corresponding data splits created by the authors. In this addendum, we extend our analysis to comparing DataSAIL splits to these predefined data splits. We achieve this by measuring the amount of leaked information between splits using the scaled version of the leakage score L(π), as defined in Eqs. (2) and (20) of our original article. Apart from the three datasets used in our paper—MoleculeNet^1^, LP-PDBBind^2^, and PLINDER^3^—we also extend our analysis to PINDER^4^ and the human gold-standard dataset for protein-protein interaction (PPI) prediction by Bernett et al.^5^. This is done for completeness and due to their popularity in the field of PPI prediction.
MoleculeNet
The authors of MoleculeNet provide recommended splits for each dataset. In Table 1, they are computed using the DeepChem package^6^ and primarily comprise a random data split. Exceptions include QM7, which features a stratified split, and HIV, BACE, and BBBP, which have scaffold splits. Each dataset is split into 80% training data, 10% validation data, and 10% test data. In our original work, we utilized DataSAIL’s similarity-based splitting to achieve an 80:20 separation between the training and test data. Therefore, we had to redo the data splitting, leading to slightly different leakage measurements compared with Table 1 in the original publication. In all cases, DataSAIL splits offer reduced data leakage.Table 1. DataSAIL on MoleculeNetMoleculeNetDataSAIL S1Datasetrecomm. techniqueScaled L(π)Scaled L(π)QM7stratified0.34250.2680QM8random0.33000.2918QM9random0.33060.2727ESOLrandom0.30690.1808Freesolvrandom0.32310.1410Lipophilicytyrandom0.33430.3027MUVscaffold0.33490.3143HIVscaffold0.33060.3071BACErandom0.33090.3036BBBPscaffold0.33660.2866Tox21random0.33330.2224ToxCastrandom0.33550.2220SIDERrandom0.35130.2345ClinToxrandom0.33170.2303Comparison of the scaled L(π) of the recommended splits by MoleculeNet to the DataSAIL S1 split. The minimal data leakage is highlighted in bold.
LP-PDBBind
The authors of the Leak-Proof PDBBind (LP-PDBBind) dataset^2^ also provide a data split that reduces data leakage between the training, validation, and test sets (Table 2). Their approach guarantees a maximum protein sequence identity of 50% between the training split and any other split, and a maximum protein sequence identity of 90% between the validation and test data. Furthermore, the maximum ligand similarity between any two splits is 0.99. The protein similarity is computed as the percentage of matching residues after a Needleman-Wunsch alignment, while the ligand similarity is measured as the Dice similarity between Morgan Fingerprints.Table 2. DataSAIL on LP-PDBBindSplit MethodScaled L(π)LP-PDBBind0.4484DataSAIL Ligand S10.6330DataSAIL Protein S10.5446DataSAIL S20.4277Comparison of the scaled L(π) of the provided split by Li et al. to three DataSAIL splits. The minimal data leakage is highlighted in bold.
The approach of providing maximum similarity limits between splits is different from DataSAIL’s approach of minimizing the total amount of leaked information, which we also describe in the section “The (k, R, C)-DataSAIL problem” and Eq. (1) in the original publication. While Eq. (1) is derived from a different publication (Elangovan et al.^7^), it describes the concept of minimizing the maximum single leak. As LP-PDBBind provides a split into approximately 61% training data, roughly 26% validation data, and approximately 13% test data, we had to recalculate the DataSAIL split here as well. This explains the deviation of the values reported here from the main manuscript. The DataSAIL S2 split offers the smallest data leakage, despite being generated by an automated procedure, whereas the LP-PDBBind split was explicitly designed for this task.
PLINDER
The Protein Ligand INteraction Dataset and Evaluation Resource (PLINDER) is a dataset containing protein-ligand interactions extracted from the PDB. The authors provide three splits, among which the PLINDER-PL50 is the most complex. It combines four similarity metrics for protein-ligand complexes: (i) sequence identity of the proteins, (ii) pocket-level Jaccard similarity using pharmacophores, (iii) interaction-level similarity using PLIP features, and (iv) ligand similarities using the Tanimoto coefficient between ECFP4 fingerprints. The algorithm then identifies clusters of similar protein-ligand systems. Finally, the test set is constructed to contain systems from clusters that have no or minimal similarity to any systems in the training or validation set. Along with this, there are two simpler splits: PLINDER-TIME, which is a time-based split, and PLINDER-ECOD, which is based on ECOD topologies.
In Supplementary Table 2 of the original publication, we already showed this table comparing DataSAIL’s splits to PLINDER. This was added on the suggestion of a reviewer. We include it here in Table 3 for completeness of this addendum. Similarly to the LP-PDBBind dataset, the DataSAIL S2 split provides a reduction of data leakage compared to the PLINDER split.Table 3. DataSAIL on PLINDERSplitScaled L(π)PLINDER-PL500.0678PLINDER-ECOD0.3601PLINDER-TIME0.3682DataSAIL Ligand S10.2307DataSAIL Protein S10.4008DataSAIL S20.0252Comparison of the scaled L(π) of the different splits of the PLINDER-NR dataset to three DataSAIL splits. The minimal data leakage is highlighted in bold.
PINDER
The Protein INteraction Dataset and Evaluation Resource (PINDER) contains curated and well-annotated PPIs obtained from the RCSB NextGen database^8^. After data cleaning and preprocessing, PINDER provides a split with minimized data leakage. To measure the leakage between two systems (interacting protein-protein pairs), the authors employed FoldSeek^9^ and MMseqs^10^ to compare and cluster protein sequences and structures. In Table 4, we compare DataSAIL to version 1 of PINDER, released in November 2023.Table 4. DataSAIL on PINDERSplitScaled L(π)PINDER0.0068DataSAIL0.0140Comparison of the scaled L(π) of the main PINDER split to the described DataSAIL split. The minimal data leakage is highlighted in bold.
Other than for the LP-PDBBind dataset, we can define a similarity metric that incorporates both dimensions in this two-dimensional dataset of interacting proteins. Therefore, we did not directly use DataSAIL’s S2 splitting module but rather the S1 with all protein sequences from both dimensions, weighted with the number of interactions each protein participates in. From the resulting assignment, we assigned an interaction to a split if and only if both proteins are assigned to that same split. Here, the PINDER split exhibits less data leakage than the DataSAIL split.
“Gold standard” human PPI dataset
In their work, Bernett et al.^5^ demonstrate that the performance of many sequence-based models for PPI prediction relies solely on a bias in the data, and that the models memorize the data rather than generalizing from it. Therefore, most models only work because of information leakage, rather than because they accurately understand biology. The authors also provide a dataset to facilitate the evaluation of future models. This dataset is called the “gold standard dataset”^5^ (Table 5).Table 5. DataSAIL on the “gold standard” PPI datasetSplitScaled L(π) Goldstandard dataset0.3642DataSAIL0.0465Comparison of the scaled L(π) of the provided split by Bernett et al. to the described DataSAIL split. The minimal data leakage is highlighted in bold.
To produce it, the authors used KaHIP^11^ to split the whole human proteome into three partitions, P0, P1, and P2. This was done using the all-against-all protein sequence similarity matrix obtained from SIMAP2^12^ with length-normalized bitscores. Based on these partitions, the PPIs from the HIPPIE database v2.3^13^ were assigned to the three partitions if and only if both interacting proteins were in the respective partition. Negative interactions were sampled randomly while preserving the node degree distribution of the respective partition. Lastly, CD-HIT was employed to enforce a sequence similarity threshold of 40% and to remove sequence redundancy. The final dataset has 274,500 interactions.
With DataSAIL, we followed the same procedure as for splitting the PINDER dataset, preserving the split ratios and utilizing stratification to maintain a balance between positive and negative interactions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Li, J. et al. Leak proof PDB Bind: a reorganized dataset of protein-ligand complexes for more generalizable binding affinity prediction. Preprint at https://ar Xiv.org/abs/2308.09639 (2023).10.1021/acs.jpcb.5c 0859841486605 · doi ↗ · pubmed ↗
- 2Durairaj, J. et al. PLINDER: the protein-ligand interactions dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603955 (2024).
- 3Kovtun, D. et al. PINDER: the protein interaction dataset and evaluation resource. Preprint at https://www.biorxiv.org/content/10.1101/2024.07.17.603980 (2024).
- 4Bernett J., Blumenthal D. B. & List M. Cracking the black box of deep sequence-based protein–protein interaction prediction. Brief. Bioinform. 25, bbae 076 (2023).10.1093/bib/bbae 076PMC 1093936238446741 · doi ↗ · pubmed ↗
- 5Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K. & Wu, Z. Deep Learning for the Life Sciences (O’Reilly Media, 2019).
- 6Choudhary, P. et al. PDB Next Gen Archive: centralizing access to integrated annotations and enriched structural information by the Worldwide Protein Data Bank. Database 2024, baae 041 (2024).10.1093/database/baae 041PMC 1113052138803272 · doi ↗ · pubmed ↗
- 7Sanders, P. & Schulz, C. Think Locally, Act Globally: Highly Balanced Graph Partitioning. In Experimental Algorithms. SEA 2013 (eds Bonifaci, V., Demetrescu, C., Marchetti-Spaccamela, A.) (Springer, 2013).
- 8Alanis-Lobato, G., Andrade-Navarro, M. A. & Schaefer, M. H. HIPPIE v 2.0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Res. 45, D 408–D 414 (2016).10.1093/nar/gkw 985PMC 521065927794551 · doi ↗ · pubmed ↗
