# Identification of Small Open Reading Frame-encoded Proteins in the Human Genome

**Authors:** Hitesh Kore, Satomi Okano, Keshava K Datta, Jackson Thorp, Parthiban Periasamy, Mayur Divate, Upekha Liyanage, Gunter Hartel, Shivashankar H Nagaraj, Harsha Gowda

PMC · DOI: 10.1093/gpbjnl/qzaf004 · Genomics, Proteomics & Bioinformatics · 2025-02-07

## TL;DR

This study identifies thousands of small proteins in the human genome that were previously overlooked, expanding our understanding of protein-coding genes.

## Contribution

The study introduces an integrated proteogenomics workflow to identify reliable small open reading frame-encoded proteins (SEPs) in the human genome.

## Key findings

- 4008 sORFs showed recurrent ribosome occupancy signals across samples, indicating potential protein translation.
- 825 SEPs were identified using proteomic data, some located in GWAS loci linked to traits and diseases.
- Peptides from SEPs are presented by MHC-I, similar to canonical proteins, suggesting immune relevance.

## Abstract

One of the main goals of the Human Genome Project is to identify all protein-coding genes. There are ∼ 20,500 protein-coding genes annotated in the human reference databases. However, in the last few years, proteogenomics studies have predicted thousands of novel protein-coding regions, including low-molecular-weight proteins encoded by small open reading frames (sORFs) in untranslated regions of messenger RNAs and non-coding RNAs. Most of these predictions are based on bioinformatics analyses and ribosome footprint data. The validity of some of these sORF-encoded proteins (SEPs) has been established through functional characterization. With the growing number of predicted novel proteins, a strategy to identify reliable candidates that warrant further studies is needed. In this study, we developed an integrated proteogenomics workflow to identify a reliable set of novel protein-coding regions in the human genome based on their recurrent observations across multiple samples. Publicly available ribosome profiling and global proteomic datasets were used to establish protein-coding evidence. We predicted protein translation from 4008 sORFs based on recurrent ribosome occupancy signals across samples. In addition, we identified 825 SEPs based on proteomic data. Some of the novel protein-coding regions identified were located in genome-wide association study (GWAS) loci associated with various traits and disease phenotypes. Peptides from SEPs are also presented by major histocompatibility complex class I (MHC-I), similar to canonical proteins. Novel protein-coding regions reported in this study expand the current catalog of protein-coding genes and warrant experimental studies to elucidate their cellular functions and potential roles in human diseases.

Graphical Abstract

## Linked entities

- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12236067/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12236067/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/PMC12236067/full.md

---
Source: https://tomesphere.com/paper/PMC12236067