On the diminishing return of labeling clinical reports

Jean-Baptiste Lamare; Tobi Olatunji; Li Yao

arXiv:2010.14587·cs.CL·October 29, 2020

On the diminishing return of labeling clinical reports

Jean-Baptiste Lamare, Tobi Olatunji, Li Yao

PDF

TL;DR

This paper demonstrates that in medical NLP, larger datasets do not always lead to better models, and high-performing models can be achieved with relatively small labeled datasets due to domain specificity.

Contribution

It reveals the counter-intuitive finding that small labeled datasets can produce superior medical NLP models, challenging the common belief from non-medical NLP.

Findings

01

Small datasets can outperform larger ones in medical NLP.

02

Models trained on limited data outperform rule-based systems.

03

Performance plateaus or diminishes with increasing data size.

Abstract

Ample evidence suggests that better machine learning models may be steadily obtained by training on increasingly larger datasets on natural language processing (NLP) problems from non-medical domains. Whether the same holds true for medical NLP has by far not been thoroughly investigated. This work shows that this is indeed not always the case. We reveal the somehow counter-intuitive observation that performant medical NLP models may be obtained with small amount of labeled data, quite the opposite to the common belief, most likely due to the domain specificity of the problem. We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets on the task of abnormality classification. The trained models not only make use of the training data efficiently, but also outperform the current state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.