Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
Kush Dubey

TL;DR
This paper investigates whether pretraining on unlabeled test data biases few-shot NLP benchmarks and finds no evidence of overoptimism, emphasizing the importance of multiple training folds for reliable evaluation.
Contribution
It provides an empirical analysis of the potential bias from test data pretraining in few-shot NLP benchmarks and offers methodological recommendations.
Findings
No evidence of overoptimism from test data pretraining
Recommends multiple training folds for robust evaluation
Highlights importance of repeated subsampling in experiments
Abstract
Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models -- BERT, GPT-2, and Mistral 7B -- do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available at https://github.com/kddubey/pretrain-on-test/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · WordPiece · Linear Warmup With Linear Decay · Linear Layer · Residual Connection · Cosine Annealing · Byte Pair Encoding · BERT
