# Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

**Authors:** Frixos Papadopoulos, Tilman Sanchez-Elsner, Mahesan Niranjan, Ashley I. Heinson

PMC · DOI: 10.1371/journal.pone.0325531 · PLOS One · 2025-08-06

## TL;DR

This paper compares simple text-based methods with advanced AI techniques for predicting protein functions and finds simpler methods can perform better.

## Contribution

The study shows that basic bag-of-words methods outperform self-supervised learning in protein inference tasks.

## Key findings

- Bag-of-words histograms outperform self-supervised learning in sequence similarity and protein inference tasks.
- Top discriminant features in bag-of-words capture important information for function prediction.
- Results suggest alternative pre-training schemes may be needed for better biological insights.

## Abstract

Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.

## Full-text entities

- **Chemicals:** amino acid (MESH:D000596)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12327643/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12327643/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/PMC12327643/full.md

---
Source: https://tomesphere.com/paper/PMC12327643