# Protein-coding genes in humans and model mammals (mouse, rat and pig): gene identifiers and disambiguation of gene nomenclature retrieved from the Ensembl genome browser

**Authors:** Grzegorz R. Juszczak, Chandra S. Pareek, Urszula Czarnik, Mariusz Pierzchała

PMC · DOI: 10.1186/s12864-025-12329-8 · BMC Genomics · 2025-12-17

## TL;DR

This paper analyzes gene symbols in humans and model mammals to highlight issues with ambiguity and proposes a solution using stable IDs and an R script.

## Contribution

The paper introduces an R script (REgeness) to update gene symbols and identify ambiguities using Ensembl data.

## Key findings

- Gene symbols are incomplete and ambiguous, affecting 10-18% of genes in rat, mouse, and human.
- Stable gene IDs from databases like Ensembl and NCBI can resolve ambiguity in gene identification.
- An R script was developed to integrate and update gene symbols with annotations on ambiguity.

## Abstract

Gene nomenclature contains current official symbols and various numbers of synonyms, which pose a challenge to integrating genomic data and increase the probability that different genes share the same symbol. Therefore, we retrieved identifiers assigned to all protein-coding genes in human, mouse, rat and pig genomes that are available in the Ensembl genome browser (release 113) to assess the number of genes, compare species and identify ambiguous symbols. Results: Our analysis revealed that the total number of symbols, both official symbols and synonyms, used to identify protein-coding genes ranges from 16,600 in pigs to 64,580 in mice. Furthermore, the gene nomenclature is not complete because there are also genes without an assigned symbol, which indicates gaps in understanding protein-coding genes, especially in pigs. We also found a large number of gene symbols that map to more than one gene. These symbols might complicate the identification of about 10% of rat and mouse genes and 18% of human protein-coding genes. A simple solution for this problem is the usage of stable gene IDs assigned by scientific institutions and committees (Ensembl, NCBI, RGD, HGNC and VGNC) provided that the genomic information associated with these IDs is retrieved directly from proprietary databases containing the most accurate data. Finally, although gene symbols may pose a problem with unequivocal identification of genes, there are instances when no other identifiers are available in the literature. Therefore, we have developed an R script performing search of the Ensembl database and integrating data to provide a single list of updated symbols with annotation about their ambiguity. Conclusions: Gene symbols are not always reliable and should be reported together with stable IDs to enable unequivocal identification of genes. Therefore, data containing only gene symbols should be used cautiously to avoid misidentification of genes. A solution for this problem is our R script REgeness that performs a gene symbol update to current official versions combined with identification of ambiguous symbols and retrieval of other IDs from the Ensembl database.

The online version contains supplementary material available at 10.1186/s12864-025-12329-8.

## Linked entities

- **Species:** Homo sapiens (taxon 9606), Mus musculus (taxon 10090), Rattus norvegicus (taxon 10116)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606], Rattus norvegicus (brown rat, species) [taxon 10116], Sus scrofa (pig, species) [taxon 9823], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12822150/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12822150/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12822150/full.md

---
Source: https://tomesphere.com/paper/PMC12822150