A German Corpus for Text Similarity Detection Tasks

Juan-Manuel Torres-Moreno; Gerardo Sierra; Peter Peinl

arXiv:1703.03923·cs.IR·March 14, 2017·2 cites

A German Corpus for Text Similarity Detection Tasks

Juan-Manuel Torres-Moreno, Gerardo Sierra, Peter Peinl

PDF

Open Access

TL;DR

This paper introduces a new German corpus designed for text similarity detection, enabling evaluation of various similarity measures at both document and sentence levels.

Contribution

The paper provides a novel German corpus specifically created for assessing text similarity algorithms, filling a gap in resources for German language processing.

Findings

01

Calculated multiple similarity measures on the corpus

02

Evaluated effectiveness of different similarity functions

03

Facilitated automatic similarity assessment for texts

Abstract

Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques