Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li; Jeffrey Flanigan

arXiv:2312.16337·cs.CL·January 2, 2024·6 cites

Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li, Jeffrey Flanigan

PDF

Open Access 1 Datasets

TL;DR

This paper reveals that the impressive zero-shot and few-shot performance of large language models may be significantly influenced by task contamination, especially on datasets released before the models' training data, raising concerns about true generalization.

Contribution

The study provides empirical evidence of task contamination affecting LLM evaluation and introduces methods to detect and analyze this issue across multiple models and datasets.

Findings

01

LLMs perform better on datasets released before their training data creation date.

02

Task contamination is a significant factor in zero-shot and few-shot evaluation results.

03

LLMs show minimal improvement over simple baselines on uncontaminated classification tasks.

Abstract

Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets released prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, task example extraction, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

reglab/legal_rag_hallucinations
dataset· 147 dl
147 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsAttention Is All You Need · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Linear Layer · Multi-Head Attention · Adam · Dense Connections · Linear Warmup With Cosine Annealing · Weight Decay