A German Corpus for Text Similarity Detection Tasks
Juan-Manuel Torres-Moreno, Gerardo Sierra, Peter Peinl

TL;DR
This paper introduces a new German corpus designed for text similarity detection, enabling evaluation of various similarity measures at both document and sentence levels.
Contribution
The paper provides a novel German corpus specifically created for assessing text similarity algorithms, filling a gap in resources for German language processing.
Findings
Calculated multiple similarity measures on the corpus
Evaluated effectiveness of different similarity functions
Facilitated automatic similarity assessment for texts
Abstract
Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
