# Agent-based large language model system for extracting structured data from breast cancer synoptic reports: a dual-validation study

**Authors:** Steven N Hart, Teya S Bergamaschi

PMC · DOI: 10.1093/jamiaopen/ooag016 · JAMIA Open · 2026-02-17

## TL;DR

This study developed and tested an AI system using large language models to extract structured data from breast cancer reports, finding that real-world performance is significantly lower than synthetic tests.

## Contribution

A dual-validation approach using synthetic and real-world data to assess LLM performance in extracting structured data from pathology reports.

## Key findings

- Synthetic validation showed high accuracy (93.8%-99.0%), but real-world recall was much lower (61.8%-87.7%).
- The gemini-2.5-pro model achieved the highest real-world recall at 87.7%.
- Smaller models like deepseek-r1 14B outperformed larger versions in real-world settings.

## Abstract

To develop and validate an agent-based Large Language Model (LLM) system for extracting structured data from breast cancer synoptic pathology reports and assess the performance gap between synthetic and real-world validation.

We developed a modular artificial intelligence (AI) agent-based framework employing sequential specialized LLMs for parsing pathology reports and extracting structured data. We normalized College of American Pathologists (CAP) cancer protocols into 8 sections, 86 subsections, and 229 discrete fields. Seven leading LLMs (gemini-2.5-pro, llama3.3-70b, phi4-14b, deepseek-r1 14B/70B, gemma3-27b, gemini-2.0-flash-lite) were validated using dual evaluation: synthetic validation (864 controlled test cases) and real-world ground truth (6651 annotated fields from 90 pathology reports).

Synthetic validation demonstrated strong performance (accuracy: 93.8%-99.0%). Real-world evaluation revealed field extraction recall ranging from 61.8% to 87.7%, demonstrating a substantial “reality gap” with performance drops of 11-32 percentage points. The gemini-2.5-pro model achieved the highest real-world recall (87.7%). Model size did not predict performance: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%).

The substantial performance degradation from synthetic to real-world data underscores the complexity of authentic clinical documentation. Smaller models can achieve competitive or superior recall, reducing computational costs. With even the best models missing 12%-38% of annotated fields, mandatory human verification is essential for clinical deployment.

While LLM-based extraction systems show promise for pathology data extraction, synthetic validation alone provides false confidence. Rigorous real-world ground truth evaluation with expert annotation is essential before clinical deployment. These systems are best positioned as screening tools with mandatory human oversight rather than autonomous decision-making systems.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Diseases:** Phyllodes tumors (MESH:D003557), DCIS (MESH:D002285), cancer (MESH:D009369), LLM (MESH:D007806), CAP (MESH:D006478), metastasis (MESH:D009362), Breast cancer (MESH:D001943), infiltrating ductal carcinoma (MESH:D044584), invasive carcinoma (MESH:D009361)
- **Chemicals:** gemini-2.0-flash-lite (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** gemma3-27b — Homo sapiens (Human), Amyotrophic lateral sclerosis 1, Induced pluripotent stem cell (CVCL_8997), llama3.3-70b — Anopheles gambiae (African malaria mosquito), Spontaneously immortalized cell line (CVCL_Z623), phi4-14b — Homo sapiens (Human), Pleural malignant mesothelioma, Cancer cell line (CVCL_V417)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12932940/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12932940/full.md

## References

10 references — full list in the complete paper: https://tomesphere.com/paper/PMC12932940/full.md

---
Source: https://tomesphere.com/paper/PMC12932940