Corpus-Based Paraphrase Detection Experiments and Review
Tedo Vrbanec, Ana Mestrovic

TL;DR
This paper reviews and compares the performance of various corpus-based models, especially deep learning approaches, for paraphrase detection across multiple datasets, highlighting their competitiveness and potential for further development.
Contribution
It provides a comprehensive performance overview of eight models on three datasets, identifying effective preprocessing and model configurations for paraphrase detection.
Findings
Deep learning models are highly competitive with traditional methods.
Optimal preprocessing and hyper-parameter choices improve model performance.
DL models show significant potential for future research in paraphrase detection.
Abstract
Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGloVe Embeddings
