On the diminishing return of labeling clinical reports
Jean-Baptiste Lamare, Tobi Olatunji, Li Yao

TL;DR
This paper demonstrates that in medical NLP, larger datasets do not always lead to better models, and high-performing models can be achieved with relatively small labeled datasets due to domain specificity.
Contribution
It reveals the counter-intuitive finding that small labeled datasets can produce superior medical NLP models, challenging the common belief from non-medical NLP.
Findings
Small datasets can outperform larger ones in medical NLP.
Models trained on limited data outperform rule-based systems.
Performance plateaus or diminishes with increasing data size.
Abstract
Ample evidence suggests that better machine learning models may be steadily obtained by training on increasingly larger datasets on natural language processing (NLP) problems from non-medical domains. Whether the same holds true for medical NLP has by far not been thoroughly investigated. This work shows that this is indeed not always the case. We reveal the somehow counter-intuitive observation that performant medical NLP models may be obtained with small amount of labeled data, quite the opposite to the common belief, most likely due to the domain specificity of the problem. We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets on the task of abnormality classification. The trained models not only make use of the training data efficiently, but also outperform the current state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
