README: Bridging Medical Jargon and Lay Understanding for Patient   Education through Data-Centric NLP

Zonghai Yao; Nandyala Siddharth Kantu; Guanghao Wei; Hieu Tran,; Zhangqi Duan; Sunjae Kwon; Zhichao Yang; README annotation team; Hong Yu

arXiv:2312.15561·cs.CL·October 28, 2024·1 cites

README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran,, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, README annotation team, Hong Yu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a large dataset and a data-centric NLP pipeline to automatically generate patient-friendly lay definitions of medical terms, improving understanding and supporting patient education.

Contribution

It creates the README dataset with over 50,000 term-definition pairs and develops a retrieval-augmented generation approach to enhance model accuracy and reduce hallucinations.

Findings

01

Models fine-tuned on high-quality data outperform some large language models.

02

The dataset enables effective automatic and human evaluation of lay definitions.

03

Open-source models can match or surpass proprietary models in patient education tasks.

Abstract

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seasonyao/noteaid-readme
noneOfficial

Datasets

bio-nlp-umass/NoteAid-README
dataset· 17 dl
17 dl

Videos

README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP· underline

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Text Readability and Simplification

MethodsFocus