# SciLinker: a large-scale text mining framework for mapping associations among biological entities

**Authors:** Dongyu Liu, Cora Ames, Shameer Khader, Franck Rapaport

PMC · DOI: 10.3389/frai.2025.1528562 · Frontiers in Artificial Intelligence · 2025-03-19

## TL;DR

SciLinker is a text mining tool that extracts and quantifies relationships between genes, diseases, and other biological entities from biomedical literature.

## Contribution

SciLinker introduces a novel framework for mapping associations among biological entities using co-occurrence analysis and relationship extraction from PubMed abstracts.

## Key findings

- Over 30 million association sentences were identified, including more than 11 million gene-disease co-occurrence sentences.
- SciLinker revealed more than 1.25 million unique gene-disease associations, with osteoporosis used as a case study.
- The tool enables construction of disease-specific networks and supports drug discovery by identifying clinically validated targets.

## Abstract

The biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora.

We developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction.

We identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker’s ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature.

SciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.

## Linked entities

- **Diseases:** osteoporosis (MONDO:0005298)

## Full-text entities

- **Diseases:** osteoporosis (MESH:D010024)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11983328/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11983328/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/PMC11983328/full.md

---
Source: https://tomesphere.com/paper/PMC11983328