Punctuation restoration in Swedish through fine-tuned KB-BERT
John Bj\"orkman Nilsson

TL;DR
This paper introduces prestoBERT, a fine-tuned Swedish BERT model for automatic punctuation restoration, achieving competitive F1-scores and demonstrating potential benefits for NLP tasks like speech-to-text and automated text processing.
Contribution
The study presents a novel fine-tuning approach for Swedish BERT to restore punctuation, with a detailed evaluation against human performance and international models.
Findings
prestoBERT achieved an F1-score of 78.9
Comparable performance to Hungarian and Chinese models
Human evaluators scored 81.7 but struggled with punctuation consistency
Abstract
Presented here is a method for automatic punctuation restoration in Swedish using a BERT model. The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus by National Library of Sweden. This model has then been fine-tuned for this specific task using a corpus of government texts. With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output. A successful solution to this problem brings benefits for an array of NLP domains, such as speech-to-text and automated text. Only the punctuation marks period, comma and question marks were considered for the project, due to a lack of data for more rare marks such as semicolon. Additionally, some marks are somewhat interchangeable with the more common, such as exclamation points and periods. Thus, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Multi-Head Attention · Dense Connections · Dropout · Attention Dropout · Adam
