# Tracking Down the Evolution of Microorganisms by Exhaustive Bottom-Up Analysis of Proteomes

**Authors:** Dmitrii O. Kostenko, Natalya S. Bogatyreva, Alexey N. Fedorov

PMC · DOI: 10.3390/ijms27010109 · International Journal of Molecular Sciences · 2025-12-22

## TL;DR

This study introduces a new method to analyze proteomes by examining k-mer frequencies, revealing strong correlations with evolutionary relationships among bacteria.

## Contribution

The study introduces a bottom-up proteomic analysis method using k-mer frequencies to trace microbial evolution.

## Key findings

- K-mer frequency vectors coevolve unambiguously with entire proteomes.
- Tripeptide frequency analysis can precisely position proteomes in k-mer space.
- K-mer-based comparisons correlate up to 99% with phylogenetic classifications.

## Abstract

Proteomes are typically analyzed at the level of individual proteins or protein families. In this study, we introduce a bottom-up approach that treats proteomes as holistic entities by examining the properties of k-mers within entire proteomes and protein groups. We performed a comprehensive analysis of short amino acid k-mer (k = 1, 2, 3) distributions across all proteins in a given proteome. Using 86 bacterial proteomes representing 18 clades, we evaluated whether k-mer frequencies characterize uniquely the analyzed organisms. Remarkably, in a post hoc analysis, we found that the k-mer frequency vector unambiguously coevolves with the entire proteome—a pattern not observed even within specific protein groups, such as conserved ribosomal proteins or more variable nucleotide-binding proteins. This finding holds regardless of the k-mer calculation parameters or the distance metrics employed. Our results show that even a simple analysis based on tripeptide frequencies can precisely position proteomes within the k-mer space. Moreover, relationships derived from k-mer comparisons highly correlate with evolutionary relationships derived from phylogenetic trees, reaching up to 99% match with reference classification of the proteomes within major bacterial clades. These findings establish k-mer-based proteomic analysis as an additional robust and powerful feature for characterizing evolutionary relationships, opening new pathways in phylogenetics and evolutionary genomics.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), COVID-19 (MESH:D000086382)
- **Chemicals:** Valine (MESH:D014633), Tyrosine (MESH:D014443), Leucine (MESH:D007930), Glutamic acid (MESH:D018698), dipeptides (MESH:D004151), Glycine (MESH:D005998), Asparagine (MESH:D001216), Proline (MESH:D011392), Methionine (MESH:D008715), pyrrolysine (MESH:C456839), Isoleucine (MESH:D007532), Alanine (MESH:D000409), Glutamine (MESH:D005973), Cysteine (MESH:D003545), Tryptophan (MESH:D014364), selenocysteine (MESH:D017279), Threonine (MESH:D013912), Aromatic (-), Amino acid (MESH:D000596), Aspartic acid (MESH:D001224), Serine (MESH:D012694), Phenylalanine (MESH:D010649)
- **Species:** Parabacteroides distasonis (species) [taxon 823], Aquifex aeolicus (species) [taxon 63363], Thermotogota (phylum) [taxon 200918], Escherichia coli (E. coli, species) [taxon 562], Planctopirus limnophila (species) [taxon 120], Azorhizobium caulinodans (species) [taxon 7], Geobacter sulfurreducens (species) [taxon 35554], Fervidobacterium nodosum (species) [taxon 2424], Dehalococcoides mccartyi (species) [taxon 61435], Herbaspirillum seropedicae (species) [taxon 964], Nitrospira defluvii (species) [taxon 330214], Coraliomargarita akajimensis (species) [taxon 395922], Granulicella tundricola (species) [taxon 940615], Cetobacterium ceti (species) [taxon 180163], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395], Chloroflexia (class) [taxon 32061], Homo sapiens (human, species) [taxon 9606], Chloroflexus aurantiacus (species) [taxon 1108]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12785394/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12785394/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12785394/full.md

---
Source: https://tomesphere.com/paper/PMC12785394