Finnish Paraphrase Corpus
Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri, Skantsi, Jemina Kilpel\"ainen, Hanna-Mari Kupari, Jenna Saarni, Maija, Sev\'on, Otto Tarkka

TL;DR
This paper presents the first fully manually annotated Finnish paraphrase corpus with over 53,000 pairs, demonstrating a manual candidate selection method for high-quality paraphrase identification.
Contribution
It introduces a novel Finnish paraphrase corpus and a manual candidate selection method that ensures high quality and cost-effective paraphrase annotation.
Findings
98% of paraphrase pairs are contextually valid
Manual candidate selection is feasible and effective
Corpus enables future Finnish NLP research
Abstract
In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
