Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models
Jos\'e A. Pardo, Alicia G\'omez-Pascual, Jos\'e T. Palma, Juan A. Bot\'ia

TL;DR
This paper introduces NeuroEmbed, a novel approach that uses ontology-augmented embedding models to improve the curation, indexing, and retrieval of neurodegenerative disease cohorts and samples from large omics datasets.
Contribution
NeuroEmbed combines ontology-based normalization, semantic embedding, and QA fine-tuning to enhance cohort discovery and metadata enrichment in neurodegeneration research.
Findings
Indexed 2,801 repositories and 150,924 samples using NeuroEmbed.
Normalized over 1,700 tissue labels into 326 ontology-aligned concepts.
Improved retrieval precision from 0.277 to 0.866 after fine-tuning.
Abstract
The growing volume of omics and clinical data generated for neurodegenerative diseases (NDs) requires new approaches for their curation so they can be ready-to-use in bioinformatics. NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples. The NeuroEmbed method comprises four stages: (1) extraction of ND cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical ontologies and clustering on the embedding space; (3) automated generation of a natural language question-answering (QA) dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions and (4) fine-tuning of a domain-specific embedder to optimize queries. We illustrate the approach using the GEO repository and the PubMedBERT pretrained embedder.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Bioinformatics and Genomic Networks · Microbial Metabolic Engineering and Bioproduction
