Improving Large-scale Paraphrase Acquisition and Generation

Yao Dou; Chao Jiang; Wei Xu

arXiv:2210.03235·cs.CL·November 9, 2022

Improving Large-scale Paraphrase Acquisition and Generation

Yao Dou, Chao Jiang, Wei Xu

PDF

Open Access

TL;DR

This paper introduces the MultiPIT corpus with high-quality annotations for paraphrase identification and generation, leading to state-of-the-art results and more diverse paraphrases in large-scale Twitter data.

Contribution

It presents a new multi-topic Twitter paraphrase dataset with improved annotations and task-specific definitions, enhancing paraphrase identification and generation models.

Findings

01

State-of-the-art 84.2 F1 in paraphrase identification

02

Models trained on MultiPIT produce more diverse, high-quality paraphrases

03

New dataset improves both identification and generation tasks

Abstract

This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsTest