Evaluating Sentence Segmentation and Word Tokenization Systems on   Estonian Web Texts

Kairit Sirts; Kairit Peekman

arXiv:2011.07868·cs.CL·November 17, 2020

Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

Kairit Sirts, Kairit Peekman

PDF

1 Repo

TL;DR

This paper evaluates the performance of three sentence segmentation and word tokenization systems on noisy Estonian web texts, highlighting challenges and differences from well-formed texts.

Contribution

It provides a manual annotation of Estonian web texts and compares the performance of EstNLTK, Stanza, and UDPipe on this challenging dataset.

Findings

01

EstNLTK outperforms other systems in sentence segmentation

02

All systems perform worse on web texts than on well-formed texts

03

Stanza and UDPipe show significant performance gaps on noisy data

Abstract

Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ksirts/EWTB_sentence_seg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.