Extracting and filtering paraphrases by bridging natural language inference and paraphrasing
Matej Klemen, Marko Robnik-\v{S}ikonja

TL;DR
This paper introduces a novel method that leverages natural language inference to extract and refine paraphrasing datasets, improving quality and revealing noise in existing datasets using transformer models.
Contribution
It proposes a bidirectional entailment approach to extract paraphrases from NLI datasets and to clean existing paraphrasing datasets, demonstrating high-quality results.
Findings
High-quality paraphrasing datasets extracted
Significant noise detected in existing datasets
Transformer models effective in evaluation
Abstract
Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on bidirectional entailment; namely, if two sentences can be mutually entailed, they are paraphrases. We evaluate our approach using several large pretrained transformer language models in the monolingual and cross-lingual setting. The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
