Determinants of Training Corpus Size for Clinical Text Classification
Jaya Chaturvedi, Saniya Deshpande, Chenkai Ma, Robert Cobb, Angus Roberts, Robert Stewart, Daniel Stahl, Diana Shamsutdinova

TL;DR
This study investigates how training corpus size and vocabulary properties affect clinical text classification performance, finding that around 600 documents suffice for near-maximum accuracy and that vocabulary quality influences learning curves.
Contribution
It provides empirical evidence on optimal training data size and vocabulary factors for clinical NLP classification tasks using BERT and Random Forests.
Findings
600 documents achieve 95% of maximum performance
More strong predictors improve accuracy, noisy words decrease it
Vocabulary properties significantly influence learning curves
Abstract
Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Medical Coding and Health Information · Artificial Intelligence in Healthcare and Education
