Extracting Knowledge From Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation Study

Jiarui Yao; Zinaida Perova; Tushar Mandloi; Elizabeth Lewis; Helen Parkinson; Guergana Savova

PMC · DOI:10.2196/70706·June 30, 2025

Extracting Knowledge From Scientific Texts on Patient-Derived Cancer Models Using Large Language Models: Algorithm Development and Validation Study

Jiarui Yao, Zinaida Perova, Tushar Mandloi, Elizabeth Lewis, Helen Parkinson, Guergana Savova

PDF

Open Access

TL;DR

This study explores how large language models can extract information about patient-derived cancer models from scientific texts, comparing different prompting techniques.

Contribution

The study introduces and evaluates soft prompting as a novel method for entity extraction from scientific literature on cancer models.

Findings

01

GPT4-o with direct prompting achieved F1-scores of 50.48 (exact match) and 71.36 (overlapping match).

02

LLaMA3 with soft prompting outperformed direct prompting, reaching F1-scores of 46.68 (exact) and 71.80 (overlapping).

03

Soft prompting on open models can match the performance of proprietary large language models.

Abstract

Patient-derived cancer models (PDCMs) have become essential tools in cancer research and preclinical studies. Consequently, the number of publications on PDCMs has increased significantly over the past decade. Advances in artificial intelligence, particularly in large language models (LLMs), offer promising solutions for extracting knowledge from scientific literature at scale. This study aims to investigate LLM-based systems, focusing specifically on prompting techniques for the automated extraction of PDCM-related entities from scientific texts. We explore 2 LLM-prompting approaches. The classic method, direct prompting, involves manually designing a prompt. Our direct prompt consists of an instruction, entity-type definitions, gold examples, and a query. In addition, we experiment with a novel and underexplored prompting strategy—soft prompting. Unlike direct prompting, soft…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

cancer

Figures4

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Biomedical Text Mining and Ontologies