Task Contamination: Language Models May Not Be Few-Shot Anymore
Changmao Li, Jeffrey Flanigan

TL;DR
This paper reveals that the impressive zero-shot and few-shot performance of large language models may be significantly influenced by task contamination, especially on datasets released before the models' training data, raising concerns about true generalization.
Contribution
The study provides empirical evidence of task contamination affecting LLM evaluation and introduces methods to detect and analyze this issue across multiple models and datasets.
Findings
LLMs perform better on datasets released before their training data creation date.
Task contamination is a significant factor in zero-shot and few-shot evaluation results.
LLMs show minimal improvement over simple baselines on uncontaminated classification tasks.
Abstract
Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
MethodsAttention Is All You Need · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Linear Layer · Multi-Head Attention · Adam · Dense Connections · Linear Warmup With Cosine Annealing · Weight Decay
