Knowledge-Guided Explainable Recommendation Tool for Cancer Risk Prediction Models Using Retrieval-Augmented Large Language Models: Development and Validation Study

Shumin Ren; Xin Zheng; Jing Zhao; Jiale Du; Yuxin Zhang; Cheng Bi; Jie Song; Jinyi Zhang; Hongmei Lang; Fan Zhang; Bairong Shen

PMC · DOI:10.2196/78519·March 9, 2026

Knowledge-Guided Explainable Recommendation Tool for Cancer Risk Prediction Models Using Retrieval-Augmented Large Language Models: Development and Validation Study

Shumin Ren, Xin Zheng, Jing Zhao, Jiale Du, Yuxin Zhang, Cheng Bi, Jie Song, Jinyi Zhang, Hongmei Lang, Fan Zhang, Bairong Shen

PDF

Open Access

TL;DR

CanRisk-RAG is a new system that helps find cancer risk prediction models more accurately and transparently than existing tools.

Contribution

Development of a retrieval-augmented, knowledge-guided system for recommending cancer risk prediction models using LLMs and structured metadata.

Findings

01

CanRisk-RAG outperformed baseline tools in relevance and reliability scores for cancer risk model queries.

02

The system provides structured, accurate recommendations based on validated evidence and multifactor ranking.

03

Experts rated CanRisk-RAG higher than PubMed, ChatGPT-4o, ScholarAI, and Gemini 1.5 Flash.

Abstract

Cancer risk prediction models are vital for precision prevention, enabling individualized assessment of cancer susceptibility based on genetic, clinical, environmental, and lifestyle factors. However, the practical use of these models is hindered by fragmented resources, heterogeneous reporting, and the absence of transparent, structured systems for systematic discovery and comparison. This study aimed to develop a retrieval-augmented, knowledge-guided system that provides accurate recommendations for cancer risk prediction models. We developed CanRisk-RAG, a recommendation platform underpinned by a precisely constructed knowledge base comprising more than 800 peer-reviewed cancer risk prediction models spanning diverse cancer types, modeling approaches, and predictive variables. The system integrates (1) large language model (LLM)–based semantic tag extraction, (2) embedding…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

CanRisk

Diseases1

cancer

Figures4

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education