Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Ebrahim Rasromani; Stella K. Kang; Yanqi Xu; Beisong Liu; Garvit Luhadia; Wan Fung Chui; Felicia L. Pasadyn; Yu Chih Hung; Julie Y. An; Edwin Mathieu; Zehui Gu; Carlos Fernandez-Granda; Ammar A. Javed; Greg D. Sacks; Tamas Gonda; Chenchan Huang; Yiqiu Shen

arXiv:2507.19973·cs.AI·July 29, 2025

Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Ebrahim Rasromani, Stella K. Kang, Yanqi Xu, Beisong Liu, Garvit Luhadia, Wan Fung Chui, Felicia L. Pasadyn, Yu Chih Hung, Julie Y. An, Edwin Mathieu, Zehui Gu, Carlos Fernandez-Granda, Ammar A. Javed, Greg D. Sacks, Tamas Gonda, Chenchan Huang, Yiqiu Shen

PDF

TL;DR

This study develops and evaluates fine-tuned large language models that automatically extract pancreatic cystic lesion features and categorize risk from radiology reports, matching GPT-4o performance and enabling scalable research.

Contribution

It introduces a method for fine-tuning open-source LLMs with chain-of-thought prompting to accurately extract PCL features and risk categories from reports.

Findings

01

Fine-tuned models achieved over 97% feature extraction accuracy.

02

Risk categorization F1 scores exceeded 0.94, comparable to GPT-4o.

03

Radiologist agreement levels were maintained, demonstrating clinical reliability.

Abstract

Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.