Revisiting Sample Size Determination in Natural Language Understanding

Ernie Chang; Muhammad Hassan Rashid; Pin-Jie Lin; Changsheng Zhao,; Vera Demberg; Yangyang Shi; Vikas Chandra

arXiv:2307.00374·cs.CL·July 4, 2023·1 cites

Revisiting Sample Size Determination in Natural Language Understanding

Ernie Chang, Muhammad Hassan Rashid, Pin-Jie Lin, Changsheng Zhao,, Vera Demberg, Yangyang Shi, Vikas Chandra

PDF

Open Access 1 Repo

TL;DR

This paper investigates methods for estimating the optimal sample size in NLP tasks, proposing a simple approach to predict maximum model performance early in data annotation, thereby aiding resource-efficient model development.

Contribution

It introduces a novel, effective method to predict the upper bound of model performance from limited data, enhancing data annotation strategies in NLP.

Findings

01

Accurately predicts maximum model performance within 0.9% MAE using only 10% data.

02

Demonstrates effectiveness across four language understanding tasks.

03

Provides a practical tool for data quality and sample size estimation.

Abstract

Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples - which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pjlintw/sample-size
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms