# Towards a more accurate and reliable evaluation of machine learning protein–protein interaction prediction model performance in the presence of unavoidable dataset biases

**Authors:** Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P. Vieira, Jorge Vieira, Hugo López-Fernández

PMC · DOI: 10.1515/jib-2024-0054 · Journal of Integrative Bioinformatics · 2025-04-02

## TL;DR

This paper introduces a new metric to better evaluate machine learning models for predicting protein interactions, showing that performance drops when dataset biases are considered.

## Contribution

The novel per-protein utility metric, pp_MCC, provides a more realistic performance estimation in the presence of dataset biases.

## Key findings

- The pp_MCC metric reveals reduced model performance in random and unseen-protein splits.
- Using only sequence data yields lower adjusted performance, suggesting a need for additional protein data.
- The proposed metric allows realistic evaluation while still using random splits.

## Abstract

The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.

## Full-text entities

- **Diseases:** ESM (MESH:C538175), CD-HIT (MESH:D003424)
- **Chemicals:** mcc (MESH:C109691), acid (MESH:D000143), water (MESH:D014867), amino acid (MESH:D000596)
- **Species:** Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Homo sapiens (human, species) [taxon 9606]
- **Mutations:** M, K
- **Cell lines:** ESM-2 — Carassius auratus (Goldfish), Spontaneously immortalized cell line (CVCL_L020)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12569588/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12569588/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12569588/full.md

---
Source: https://tomesphere.com/paper/PMC12569588