FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers
Vincent Vandeghinste, Oliver Guhr

TL;DR
This paper introduces a Transformer-based Dutch punctuation and segmentation prediction system for unsegmented ASR outputs, significantly improving over baseline methods and making speech recognition results more readable and usable.
Contribution
It presents a new Dutch punctuation and segmentation model based on RobBERT, extending multilingual models, and demonstrates superior performance over baseline approaches.
Findings
Model outperforms machine translation baseline
Effective on out-of-domain data with sliding window approach
Publicly available Dutch punctuation prediction system
Abstract
When applying automated speech recognition (ASR) for Belgian Dutch (Van Dyck et al. 2021), the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. As far as we know there is no publicly available punctuation insertion system for Dutch that functions at a usable level. The model we present here is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available. We trained a sequence classification model, based on the Dutch language model RobBERT (Delobelle et al. 2020). For every word in the input sequence, the models predicts a punctuation marker that follows the word. We have also extended a multilingual model, for cases where the language is unknown or where code switching applies. When performing the task of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗oliverguhr/fullstop-dutch-punctuation-predictionmodel· 15 dl· ♡ 315 dl♡ 3
- 🤗oliverguhr/fullstop-punctuation-multilingual-basemodel· 104k dl· ♡ 7104k dl♡ 7
- 🤗oliverguhr/fullstop-dutch-sonar-punctuation-predictionmodel· 884 dl· ♡ 6884 dl♡ 6
- 🤗oliverguhr/fullstop-punctuation-multilingual-sonar-basemodel· 39k dl· ♡ 439k dl♡ 4
- 🤗Husain/fullstop-punctuation-multilingual-sonar-basemodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsTest
