Paraphrase Detection on Noisy Subtitles in Six Languages
Eetu Sj\"oblom, Mathias Creutz, Mikko Aulamo

TL;DR
This paper explores automatic paraphrase detection on noisy subtitle data across six European languages, comparing models and analyzing robustness to noise and data quality.
Contribution
It introduces and evaluates supervised sentence embedding models, notably the GRAN model, demonstrating its robustness and superior performance on noisy multilingual subtitle data.
Findings
GRAN outperforms WA in noisy conditions
More and noisier data improve results
Domain mismatch affects performance on other datasets
Abstract
We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
