Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

Haiyi Li; Yutong Li; Yiheng Chi; Alison Deslandes; Mathew Leonardi; Shay Freger; Yuan Zhang; Jodie Avery; M. Louise Hull; and Hsiang-Ting Chen

arXiv:2601.09053·cs.HC·January 27, 2026

Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

Haiyi Li, Yutong Li, Yiheng Chi, Alison Deslandes, Mathew Leonardi, Shay Freger, Yuan Zhang, Jodie Avery, M. Louise Hull, and Hsiang-Ting Chen

PDF

Open Access

TL;DR

This study evaluates large-language models for converting unstructured ultrasound reports into structured data, highlighting their strengths, limitations, and the importance of human-AI collaboration in clinical workflows.

Contribution

It demonstrates the effectiveness of large-scale LLMs in clinical report structuring and emphasizes the complementary roles of AI and human experts in medical data extraction.

Findings

01

20B LLM achieved 86.02% accuracy

02

LLMs excel at syntactic consistency tasks

03

Humans outperform LLMs in semantic interpretation

Abstract

In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Radiology practices and education