Classification performance and reproducibility of GPT-4 omni for   information extraction from veterinary electronic health records

Judit M Wulcan; Kevin L Jacques; Mary Ann Lee; Samantha L Kovacs,; Nicole Dausend; Lauren E Prince; Jonatan Wulcan; Sina Marsilio; Stefan M; Keller

arXiv:2409.13727·cs.CL·January 29, 2025

Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records

Judit M Wulcan, Kevin L Jacques, Mary Ann Lee, Samantha L Kovacs,, Nicole Dausend, Lauren E Prince, Jonatan Wulcan, Sina Marsilio, Stefan M, Keller

PDF

Open Access 1 Repo

TL;DR

This study evaluates GPT-4 omni's ability to extract clinical signs from veterinary EHRs, demonstrating high accuracy and reproducibility, outperforming GPT-3.5 Turbo and showing robustness across different settings.

Contribution

It provides a comprehensive comparison of GPT-4 omni and GPT-3.5 Turbo for veterinary EHR extraction, highlighting GPT-4 omni's superior performance and stability regardless of temperature adjustments.

Findings

01

GPT-4 omni achieved 96.9% sensitivity and 97.6% specificity.

02

GPT-4 omni outperformed GPT-3.5 Turbo, especially in sensitivity.

03

Reproducibility of GPT-4 omni was higher than human interobserver agreement.

Abstract

Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of temperature settings, and the influence of text ambiguity have not been previously evaluated. This study addresses these gaps by comparing the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions and investigating the relationship between human interobserver agreement and LLM errors. The LLMs and five humans were tasked with identifying six clinical signs associated with Feline chronic enteropathy in 250 EHRs from a veterinary referral hospital. At temperature 0, the performance of GPT-4o compared to the majority opinion of human respondents, achieved 96.9% sensitivity (interquartile range [IQR] 92.9-99.3%), 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value (IQR 70.8-84.6%), 99.5%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucdavis/llm_vet_records
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Warmup With Cosine Annealing