Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset   Augmentation Using Graph Theory

Hannah Chen; Yangfeng Ji; David Evans

arXiv:2011.01856·cs.CL·November 4, 2020·1 cites

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Hannah Chen, Yangfeng Ji, David Evans

PDF

Open Access 1 Repo

TL;DR

This paper introduces a graph-based method to automatically augment paraphrase datasets by inferring labels through transitivity and correcting mislabels with structural balance theory, leading to improved NLP model accuracy.

Contribution

It presents a novel approach combining graph theory and structural balance to enhance dataset quality and size for better paraphrase detection models.

Findings

01

Enhanced datasets improve paraphrase model accuracy

02

Graph-based label inference reduces manual labeling errors

03

Structural balance theory identifies and corrects likely mislabels

Abstract

Most NLP datasets are manually labeled, so suffer from inconsistent labeling or limited size. We propose methods for automatically improving datasets by viewing them as graphs with expected semantic properties. We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We use structural balance theory to identify likely mislabelings in the graph, and flip their labels. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hannahxchen/automatic-paraphrase-dataset-augmentation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsFLIP · Linear Layer · Softmax · Dense Connections · WordPiece · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Adam · Residual Connection