Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG
Jaafer Klila, Sondes Bannour Souihi, Rahma Boujelben, Nasredine Semmar, and Lamia Hadrich Belguith

TL;DR
This paper compares continual pretraining and graph retrieval-augmented generation for injecting structured biomedical knowledge into language models, demonstrating improved performance on biomedical tasks and question answering benchmarks.
Contribution
It introduces a large-scale UMLS-based knowledge graph, constructs models with continual pretraining, and develops a GraphRAG pipeline for knowledge retrieval without retraining.
Findings
BERTUMLS outperforms BERT on knowledge-intensive tasks.
GraphRAG improves QA accuracy by over 3-5 points without retraining.
Effects vary depending on the base model's existing knowledge.
Abstract
The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
