Diacritics Restoration using BERT with Analysis on Czech language

Jakub N\'aplava; Milan Straka; Jana Strakov\'a

arXiv:2105.11408·cs.CL·May 25, 2021

Diacritics Restoration using BERT with Analysis on Czech language

Jakub N\'aplava, Milan Straka, Jana Strakov\'a

PDF

1 Repo

TL;DR

This paper introduces a BERT-based model for diacritics restoration across 12 languages, with a detailed error analysis on Czech, revealing that many mispredictions are plausible variants or data errors.

Contribution

The paper presents a novel BERT-based architecture for diacritics restoration and provides a comprehensive error analysis on Czech, highlighting the nature of mispredictions.

Findings

01

44% of mispredictions are not true errors

02

19% are plausible variants

03

25% are corrections of erroneous data

Abstract

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ufal/bert-diacritics-restoration
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Linear Layer · Linear Warmup With Linear Decay · WordPiece · Layer Normalization · Attention Dropout · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Attention Is All You Need