# Better Inputs, Better Learning: A Peptide Embedding Tutorial for Proteomic Mass Spectrometry

**Authors:** Luke Squires, Jose Humberto Giraldez Chavez, Alfred Nilsson, Lukas Käll, Samuel H Payne

PMC · DOI: 10.1021/acs.jproteome.5c00563 · Journal of Proteome Research · 2026-01-13

## TL;DR

This paper introduces educational tools to teach how to convert peptides into numeric formats for deep learning in proteomics.

## Contribution

Provides free, hands-on tutorials for creating and comparing peptide embeddings in proteomics.

## Key findings

- Five embedding strategies are demonstrated, from simple to advanced methods.
- The final notebook benchmarks the performance of each embedding method.
- The tutorials aim to bridge the gap between proteomics and machine learning.

## Abstract

Mass spectrometry
proteomics creates complex data representing
the peptide/protein contents of biological samples. Various types
of machine learning have been central to computational methods used
to identify peptides from tandem mass spectra and numerous other aspects
of the data analysis process. As deep learning has emerged as a powerful
machine learning method for modeling and interpreting data, computational
proteomics researchers have leveraged large publicly available data
sets to train machine learning models to predict peptide fragmentation
spectra and liquid chromatography retention time. Resources like proteomicsML
offer extensive demonstrative tutorials for these learning tasks and
are closing the gap between the proteomics and machine learning communities.
However, in these and other educational materials on deep learning,
the critical step of preparing data for learning is frequently omitted.
Prior to learning, peptide strings must be converted into a numeric
formatan embedding. There are many different peptide embeddings,
and some vastly outperform others. Yet the process for creating an
embedding, and also the rationale for choosing a specific embedding,
is rarely discussed in our proteomics literature. In this technical
note, we introduce four Google Colab notebooks to teach peptide embeddings.
The series walks users through five different peptide-embedding strategies
from simplistic single-number encodings to state-of-the-art pretrained
embeddings through both code examples and narrative descriptions.
The final notebook compares the five embeddings in a head-to-head
benchmark. By making these notebooks free, we hope to lower the barrier
for researchers who want to bring modern deep learning into their
proteomics workflows.

## Full-text entities

- **Chemicals:** acid (MESH:D000143), Amino Acid (MESH:D000596)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12888018/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12888018/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12888018/full.md

---
Source: https://tomesphere.com/paper/PMC12888018