Determinants of Training Corpus Size for Clinical Text Classification

Jaya Chaturvedi; Saniya Deshpande; Chenkai Ma; Robert Cobb; Angus Roberts; Robert Stewart; Daniel Stahl; Diana Shamsutdinova

arXiv:2601.15846·cs.CL·January 23, 2026

Determinants of Training Corpus Size for Clinical Text Classification

Jaya Chaturvedi, Saniya Deshpande, Chenkai Ma, Robert Cobb, Angus Roberts, Robert Stewart, Daniel Stahl, Diana Shamsutdinova

PDF

Open Access

TL;DR

This study investigates how training corpus size and vocabulary properties affect clinical text classification performance, finding that around 600 documents suffice for near-maximum accuracy and that vocabulary quality influences learning curves.

Contribution

It provides empirical evidence on optimal training data size and vocabulary factors for clinical NLP classification tasks using BERT and Random Forests.

Findings

01

600 documents achieve 95% of maximum performance

02

More strong predictors improve accuracy, noisy words decrease it

03

Vocabulary properties significantly influence learning curves

Abstract

Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Medical Coding and Health Information · Artificial Intelligence in Healthcare and Education