Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Won Seok Jang; Sharmin Sultana; Zonghai Yao; Hieu Tran; Zhichao Yang; Sunjae Kwon; Hong Yu

arXiv:2502.16022·cs.CL·May 8, 2026

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

Won Seok Jang, Sharmin Sultana, Zonghai Yao, Hieu Tran, Zhichao Yang, Sunjae Kwon, Hong Yu

PDF

TL;DR

This study evaluates various strategies including prompting, fine-tuning, and data augmentation to enhance large language models' ability to identify and prioritize key medical terms in electronic health record notes, especially in low-resource settings.

Contribution

It demonstrates that fine-tuning and data augmentation significantly improve model performance in extracting medical jargon, with open-source models outperforming closed-source counterparts when properly augmented.

Findings

01

Fine-tuning and data augmentation improved F1 and MRR scores.

02

Open-source models with augmentation outperformed closed-source models.

03

Few-shot prompting outperformed zero-shot prompting in vanilla models.

Abstract

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.