# PepFoundry: A Pipeline for Building Machine-Learning Ready Representations of Nonstandard Peptides Containing Cycles, Non-natural Residues, Polymer Units, and More

**Authors:** Daniel Garzon Otero, Omid Akbari, Aneesh Mandapati, Camille Bilodeau

PMC · DOI: 10.1021/acs.jcim.5c02629 · Journal of Chemical Information and Modeling · 2026-01-13

## TL;DR

PepFoundry is a tool that creates machine-learning-ready representations of complex peptides with nonstandard features, enabling better prediction of their properties.

## Contribution

PepFoundry introduces a pipeline using SMILES strings to handle nonstandard peptides, enabling atomic-level ML representations.

## Key findings

- PepFoundry generates atom-mapped RDKit objects from nonstandard peptides using SMILES strings.
- Atomic-level representations outperform sequence-level ones for property prediction across model types.
- The framework supports latent space visualization and learning relationships among modified peptides.

## Abstract

Peptides featuring synthetic modifications, such as noncanonical
amino acids, backbone modifications, cyclic structures, and polymer
units have become central to modern drug design due to their enhanced
stability and functional diversity. However, current machine learning
(ML) approaches are restricted by challenges associated with transforming
peptide sequences into atom-level representations, leading ML efforts
to focus largely on datasets containing linear peptides comprised
of standard residues. Here, we present PepFoundry, a Python package
that handles peptide sequences beyond canonical amino acids and linear
topologies by using SMILES strings in the CHUCKLES format. PepFoundry
generates atom-mapped RDKit molecule objects, enabling the extraction
of atom-level features, such as Morgan fingerprints and graph representations.
We demonstrate its utility by processing a dataset of peptide sequences
containing noncanonical amino acids and generating atomic level features
for downstream property prediction. We show that atomic-level representations
of peptides containing noncanonical amino acids consistently outperform
sequence-level representations, regardless of model type. We additionally
explore the representation of noncanonical peptides through latent
space visualization and show that models with atomic-level information
can effectively learn relationships between analogous sequences of l-peptides, d-peptides, and peptoids. This framework
allows for the flexible incorporation of new amino acid chemistries,
enabling existing ML methods to be straightforwardly applied to datasets
of peptides containing nonstandard features. It also facilitates the
rapid construction of customized peptide libraries and provides a
scalable platform to accelerate ML-driven peptide discovery and optimization.

## Full-text entities

- **Chemicals:** N-acetyl-l-proline (MESH:C586914), Amino Acid (MESH:D000596), AMP (MESH:D000089882), l-leucine (MESH:D007930), nitrogen (MESH:D009584), phenylalanine (MESH:D010649), dipeptide (MESH:D004151), glycine (MESH:D005998), peptide (MESH:D010455), CAAs (-), l-Arginine (MESH:D001120), acid (MESH:D000143), oxygen (MESH:D010100), polymer (MESH:D011108), peptoid (MESH:D034444), carbon (MESH:D002244), Proline (MESH:D011392), hydrogen (MESH:D006859)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12848965/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12848965/full.md

## References

57 references — full list in the complete paper: https://tomesphere.com/paper/PMC12848965/full.md

---
Source: https://tomesphere.com/paper/PMC12848965