Sample Size in Natural Language Processing within Healthcare Research

Jaya Chaturvedi; Diana Shamsutdinova; Felix Zimmer; Sumithra; Velupillai; Daniel Stahl; Robert Stewart; Angus Roberts

arXiv:2309.02237·cs.LG·September 6, 2023

Sample Size in Natural Language Processing within Healthcare Research

Jaya Chaturvedi, Diana Shamsutdinova, Felix Zimmer, Sumithra, Velupillai, Daniel Stahl, Robert Stewart, Angus Roberts

PDF

Open Access

TL;DR

This study provides guidelines for selecting appropriate sample sizes in healthcare NLP classification tasks, demonstrating how different classifiers perform with varying sample sizes and offering recommendations for future research.

Contribution

It introduces a methodology for estimating sample sizes in healthcare NLP studies, validated through simulations on real medical datasets with different classifiers.

Findings

01

Larger sample sizes (>1000) generally improve classifier performance.

02

Support vector machines and BERT benefit from bigger samples, while K-nearest neighbors perform well with smaller samples.

03

Guidelines help predict performance and determine suitable sample sizes for healthcare text classification.

Abstract

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Adam · Weight Decay