# Using semantic search to find publicly available gene-expression datasets

**Authors:** Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo

PMC · DOI: 10.1093/bioinformatics/btag053 · Bioinformatics · 2026-02-02

## TL;DR

This paper explores using language models to improve the search for gene-expression datasets in public repositories like GEO, making it easier to find relevant data for research.

## Contribution

The study introduces a novel approach using semantic search and language models to enhance dataset discovery in gene expression repositories.

## Key findings

- Language models often outperformed GEO's search engine in finding relevant datasets for six human medical conditions.
- Top-performing models were trained on general corpora and used contrastive learning with large embeddings.
- A web-based tool was developed to implement this methodology and is publicly available.

## Abstract

Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100 000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.

We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO’s search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.

Our analysis code and a Web-based tool that enables others to use our methodology are available from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12952291/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12952291/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/PMC12952291/full.md

---
Source: https://tomesphere.com/paper/PMC12952291