Generate, Annotate, and Learn: NLP with Synthetic Text

Xuanli He; Islam Nassar; Jamie Kiros; Gholamreza Haffari; Mohammad; Norouzi

arXiv:2106.06168·cs.LG·June 1, 2022·1 cites

Generate, Annotate, and Learn: NLP with Synthetic Text

Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, Mohammad, Norouzi

PDF

Open Access 1 Repo

TL;DR

This paper introduces the GAL framework that leverages synthetic unlabeled text generated by language models to improve NLP tasks through knowledge distillation, self-training, and few-shot learning, achieving state-of-the-art results.

Contribution

The paper proposes a unified framework for using synthetic text in various learning paradigms and provides theoretical and empirical analysis of generation strategies.

Findings

01

GAL improves NLP task performance significantly.

02

Synthetic unlabeled text is more effective than labeled text for training.

03

State-of-the-art results on GLUE with 6-layer transformers.

Abstract

This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ``generate, annotate, and learn (GAL)'' to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xlhex/gal_syntex
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Knowledge Distillation · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Linear Warmup With Cosine Annealing · Attention Dropout · Dense Connections