# Liver cancer knowledge graph construction based on dynamic entity replacement and masking strategies RoBERTa-wwm-large-BiLSTM-CRF model with clinical Chinese EMRs

**Authors:** Yichi Zhang, Xiaojun Hu, Hailing Wang, Ke Liu, Yongbin Gao, Xiaoyan Jiang, Yingfang Fan, Zhijun Fang

PMC · DOI: 10.3389/frai.2025.1663877 · 2025-10-17

## TL;DR

This paper introduces a new framework to build a liver cancer knowledge graph using real-world Chinese electronic medical records, improving entity recognition and integration of clinical data.

## Contribution

The study proposes a novel DERM-based NER model and constructs the first Chinese liver cancer knowledge graph from real-world clinical data.

## Key findings

- The proposed NER model achieved an F1 score of 93.96% on the RLC-EMRs dataset.
- The liver cancer knowledge graph contains 46,364 entities and 296,655 semantic relationships.
- A KG-based retrieval system was developed for querying clinical information like complications and medications.

## Abstract

Liver cancer is a leading cause of cancer-related mortality worldwide, necessitating advanced tools for diagnosis and management. Knowledge graphs (KGs) are crucial for advancing smart healthcare, but existing liver cancer-specific KGs are mostly derived from literature or public databases, lacking integration with real-world clinical data [e.g., Electronic Medical Records (EMRs)], creating a critical gap. Furthermore, there is currently no publicly available KGs specifically for liver cancer, creating a significant gap in structured clinical knowledge resources.

This study proposes a novel framework to construct the first Chinese liver cancer KG from Real-World Liver Cancer Electronic Medical Records (RLC-EMRs). A new named entity recognition (NER) model, DERM-RoBERTa-wwm-large-BiLSTM-CRF was developed that uses a Dynamic Entity Replacement and Masking (DERM) strategy to address data scarcity. Knowledge fusion was performed using the TF-IDF algorithm to standardize and integrate entities from clinical records, the professional medical website www.XYWY.com, and the CCMT-2019 terminology standard.

The final constructed liver cancer KG contained 46,364 entities and 296,655 semantic relationships. The proposed NER model achieved a state-of-the-art F1 score of 68.84% on the public CMeEE-v2 dataset. On the proprietary RLC-EMRs dataset, the model demonstrated high effectiveness with a precision of 93.23%, recall of 94.69%, and an F1 score of 93.96%. In addition, a KG-based retrieval system was successfully developed to query for complications, medications, and other related information.

The findings demonstrated the effectiveness of the proposed framework in constructing a comprehensive and clinically relevant liver cancer KG. The novel DERM-based NER model significantly improved entity extraction from complex medical texts. By successfully integrating real-world clinical data, this study addresses a critical gap in existing liver cancer-specific KGs, which are mostly derived from literature or public databases and lack integration with real-world clinical information.

## Linked entities

- **Diseases:** liver cancer (MONDO:0002691)

## Full-text entities

- **Diseases:** cancer (MESH:D009369), Liver Cancer (MESH:D006528)

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12575302/full.md

---
Source: https://tomesphere.com/paper/PMC12575302