Finnish Paraphrase Corpus

Jenna Kanerva; Filip Ginter; Li-Hsin Chang; Iiro Rastas; Valtteri; Skantsi; Jemina Kilpel\"ainen; Hanna-Mari Kupari; Jenna Saarni; Maija; Sev\'on; Otto Tarkka

arXiv:2103.13103·cs.CL·March 25, 2021·5 cites

Finnish Paraphrase Corpus

Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri, Skantsi, Jemina Kilpel\"ainen, Hanna-Mari Kupari, Jenna Saarni, Maija, Sev\'on, Otto Tarkka

PDF

Open Access 1 Repo

TL;DR

This paper presents the first fully manually annotated Finnish paraphrase corpus with over 53,000 pairs, demonstrating a manual candidate selection method for high-quality paraphrase identification.

Contribution

It introduces a novel Finnish paraphrase corpus and a manual candidate selection method that ensures high quality and cost-effective paraphrase annotation.

Findings

01

98% of paraphrase pairs are contextually valid

02

Manual candidate selection is feasible and effective

03

Corpus enables future Finnish NLP research

Abstract

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TurkuNLP/Turku-paraphrase-corpus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems