# How far are we from the era of big data in transcriptomics? Lessons from the bacterial data in GEO

**Authors:** A S Escobedo-Muñoz, Diego Carmona-Campos, Armando G G Trapaga, Julio A Freyre-González

PMC · DOI: 10.1093/bib/bbaf560 · Briefings in Bioinformatics · 2025-10-23

## TL;DR

The paper examines the state of bacterial transcriptomic data in GEO, highlighting issues with metadata and data formats that hinder big data reuse.

## Contribution

The study identifies specific challenges in reusing bacterial microarray data and proposes guidelines to improve data FAIRness.

## Key findings

- Microarray data still constitute nearly half of bacterial transcriptomic entries in GEO.
- Lack of standard formats limits reusability of at least 44% of microarray entries.
- Metadata inconsistencies hinder automated access and interpretation for large-scale analysis.

## Abstract

The Gene Expression Omnibus (GEO) is the largest functional genomics repository, including ~5 million entries related to the main transcriptomic technologies: microarrays and RNA-seq. This amount of data has the potential to be reused in large-scale meta-analysis, such as those in bacterial systems biology, where the landscape of biological conditions is wider and more diverse than any individual experiment alone. Notwithstanding the accelerated growth in RNA-seq experiments, microarray still accounts for ~48% of bacterial transcriptomic entries in GEO, highlighting the need to revalue this data. Therefore, in this work, we assess the current state of bacterial microarray and RNA-seq data and metadata. We report diverse inconsistencies in both the GEO metadata documentation and community usage, limiting the automated access to biological context essential for high-throughput analysis interpretation. Additionally, while access to and analysis of RNA-seq data are topics widely discussed by the community, microarray data processing and normalization present challenges that need to be addressed for the proper data integration into large-scale reanalysis. Thus, we delve into the availability and processability of bacterial microarray data in GEO, showing a complex panorama where the lack of standard formats limits our reusability potential to at least 44% of the ~45 000 microarray entries. We conclude that GEO transcriptomic data and metadata should be viewed as valuable resources that require ongoing revision and maintenance. Finally, we propose a series of guidelines to enhance the Findability, Accessibility, Interoperability, and Reusability of GEO, thereby taking a step forward into the era of big data.

## Full-text entities

- **Genes:** IL31RA (interleukin 31 receptor A) [NCBI Gene 133396] {aka CRL, CRL3, GLM-R, GLMR, GPL, IL-31RA}
- **Diseases:** Hallucinations (MESH:D006212), SOFT (MESH:D058426), MIAME (MESH:D003643), GEO (MESH:D001039)
- **Species:** Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395], Bacillus subtilis (species) [taxon 1423], Staphylococcus aureus (species) [taxon 1280], Escherichia coli (E. coli, species) [taxon 562], Homo sapiens (human, species) [taxon 9606], Pseudomonas aeruginosa (species) [taxon 287], Mycobacterium tuberculosis (species) [taxon 1773]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12548026/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12548026/full.md

## References

74 references — full list in the complete paper: https://tomesphere.com/paper/PMC12548026/full.md

---
Source: https://tomesphere.com/paper/PMC12548026