Paraphrase Detection on Noisy Subtitles in Six Languages

Eetu Sj\"oblom; Mathias Creutz; Mikko Aulamo

arXiv:1809.07978·cs.CL·September 24, 2018·1 cites

Paraphrase Detection on Noisy Subtitles in Six Languages

Eetu Sj\"oblom, Mathias Creutz, Mikko Aulamo

PDF

Open Access

TL;DR

This paper explores automatic paraphrase detection on noisy subtitle data across six European languages, comparing models and analyzing robustness to noise and data quality.

Contribution

It introduces and evaluates supervised sentence embedding models, notably the GRAN model, demonstrating its robustness and superior performance on noisy multilingual subtitle data.

Findings

01

GRAN outperforms WA in noisy conditions

02

More and noisier data improve results

03

Domain mismatch affects performance on other datasets

Abstract

We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques