Few-shot Mining of Naturally Occurring Inputs and Outputs
Mandar Joshi, Terra Blevins, Mike Lewis, Daniel S. Weld and, Luke Zettlemoyer

TL;DR
This paper introduces a method for mining high-quality input-output pairs from large corpora using a two-stage supervised approach, significantly reducing the need for manual labeling and improving task performance.
Contribution
It presents a novel two-stage mining technique combining dense recall and re-ranking to extract natural input-output pairs for training, enhancing data efficiency.
Findings
Improves SQuAD F1 score by 13 points with mined data
Achieves 1.46 ROUGE-L improvement on Xsum summarization
Demonstrates effective natural data augmentation for multiple NLP tasks
Abstract
Creating labeled natural language training data is expensive and requires significant human effort. We mine input output examples from large corpora using a supervised mining function trained using a small seed set of only 100 examples. The mining consists of two stages -- (1) a biencoder-based recall-oriented dense search which pairs inputs with potential outputs, and (2) a crossencoder-based filter which re-ranks the output of the biencoder stage for better precision. Unlike model-generated data augmentation, our method mines naturally occurring high-quality input output pairs to mimic the style of the seed set for multiple tasks. On SQuAD-style reading comprehension, augmenting the seed set with the mined data results in an improvement of 13 F1 over a BART-large baseline fine-tuned only on the seed set. Likewise, we see improvements of 1.46 ROUGE-L on Xsum abstractive summarization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
