RAFT: A Real-World Few-Shot Text Classification Benchmark

Neel Alex; Eli Lifland; Lewis Tunstall; Abhishek Thakur; Pegah Maham,; C. Jess Riedel; Emmie Hine; Carolyn Ashurst; Paul Sedille; Alexis Carlier,; Michael Noetel; Andreas Stuhlm\"uller

arXiv:2109.14076·cs.CL·January 20, 2022·5 cites

RAFT: A Real-World Few-Shot Text Classification Benchmark

Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham,, C. Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier,, Michael Noetel, Andreas Stuhlm\"uller

PDF

Open Access 1 Repo 2 Datasets

TL;DR

The paper introduces RAFT, a benchmark for real-world few-shot text classification tasks, highlighting current model limitations and the gap between AI performance and human expertise in practical scenarios.

Contribution

RAFT provides a new benchmark with naturally occurring tasks and deployment-like evaluation, enabling better measurement of AI progress in real-world applications.

Findings

01

Current models struggle with reasoning over long texts.

02

Models have difficulty with tasks involving many classes.

03

Humans outperform GPT-3 on several tasks.

Abstract

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oughtinc/raft-baselines
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Linear Layer · {Dispute@FaQ-s}How to file a dispute with Expedia? · Adam · Dense Connections · Attention Dropout · Weight Decay