Few-shot Mining of Naturally Occurring Inputs and Outputs

Mandar Joshi; Terra Blevins; Mike Lewis; Daniel S. Weld and; Luke Zettlemoyer

arXiv:2205.04050·cs.CL·May 10, 2022

Few-shot Mining of Naturally Occurring Inputs and Outputs

Mandar Joshi, Terra Blevins, Mike Lewis, Daniel S. Weld and, Luke Zettlemoyer

PDF

Open Access

TL;DR

This paper introduces a method for mining high-quality input-output pairs from large corpora using a two-stage supervised approach, significantly reducing the need for manual labeling and improving task performance.

Contribution

It presents a novel two-stage mining technique combining dense recall and re-ranking to extract natural input-output pairs for training, enhancing data efficiency.

Findings

01

Improves SQuAD F1 score by 13 points with mined data

02

Achieves 1.46 ROUGE-L improvement on Xsum summarization

03

Demonstrates effective natural data augmentation for multiple NLP tasks

Abstract

Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -- (1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification