Extracting structured data from unstructured breast imaging reports with transformer-based models
Mikel Carrilero-Mardones, Jorge Pérez-Martín, Francisco Javier Díez, Iñigo Bermejo Delgado

TL;DR
This paper compares transformer-based models for converting unstructured breast imaging reports into structured data, finding BioGPT to be the most effective.
Contribution
The study introduces the use of generative models like BioGPT for multi-task extraction from medical reports, a novel approach compared to traditional BERT-based models.
Findings
BioGPT outperformed BERT-based models in classification tasks with 96.10% accuracy and 90.30% macro F1 score.
BioGPT could perform classification and extractive question answering simultaneously, a unique capability.
Generative models show potential for efficient clinical data curation and integration into research workflows.
Abstract
Structured clinical data is essential for research and informed decision-making, yet medical reports are frequently stored as unstructured free text. This study compared the performance of BERT-based and generative language models in converting unstructured breast imaging reports into structured, tabular data suitable for clinical and research applications. A dataset of 286 anonymised breast imaging reports in Spanish was translated into English and used to evaluate five transformer-based models pre-trained in medical data: BlueBERT, BioBERT, BioMedBERT, BioGPT and ClinicalT5. Two natural language processing approaches were explored: classification of 19 categorical variables (e.g. diagnostic technique, report type, family history, BI-RADS category, tumour shape and margin) and extractive question answering of four entities (patient age, patient history, parenchymal distortion or…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Radiomics and Machine Learning in Medical Imaging · Machine Learning in Healthcare
