Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports
Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta,, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, Evan Calabrese

TL;DR
This study develops an automated system using large language models and retrieval augmented generation to extract structured clinical data from unstructured radiology and pathology reports, achieving high accuracy and demonstrating practical potential.
Contribution
It introduces a systematic evaluation of LLM and RAG configurations for clinical data extraction, highlighting the importance of model choice, prompt engineering, and semi-automated optimization.
Findings
Models achieved over 98% accuracy for radiology report extraction.
Models achieved over 90% accuracy for pathology report extraction.
Domain fine-tuned models outperform older and smaller models.
Abstract
Purpose: To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weights large language models (LMs) and retrieval augmented generation (RAG), and to assess the effects of model configuration variables on extraction performance. Methods and Materials: The study utilized two datasets: 7,294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2,154 pathology reports annotated for isocitrate dehydrogenase (IDH) mutation status. An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations. The impact of model size, quantization, prompting strategies, output formatting, and inference parameters was systematically evaluated. Results: The best performing models achieved over 98% accuracy in extracting BT-RADS scores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · WordPiece · Dense Connections · Residual Connection · Linear Layer · Multi-Head Attention · Linear Warmup With Linear Decay · Adam
