# Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics

**Authors:** Hao Xu, Zhichao Wang, Shengqi Sang, Pisit Wajanasara, Nuno Bandeira

arXiv: 2508.21076 · 2025-09-01

## TL;DR

This paper introduces Pep2Prob, a comprehensive dataset and benchmark for predicting peptide-specific fragment ion probabilities in MS$^2$ proteomics, demonstrating that machine learning models leveraging peptide-specific data outperform traditional global statistics.

## Contribution

The paper presents the first dataset and benchmark for peptide-specific fragment ion probability prediction, highlighting the importance of peptide-specific information for accurate MS$^2$ analysis.

## Key findings

- Models using peptide-specific data outperform global statistics.
- Performance improves with more complex machine learning models.
- Peptide-fragmentation relationships are complex and nonlinear.

## Abstract

Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21076/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21076/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/2508.21076/full.md

---
Source: https://tomesphere.com/paper/2508.21076