Language Models of Spoken Dutch
Lyan Verwimp, Joris Pelemans, Marieke Lycke, Hugo Van hamme, Patrick, Wambacq

TL;DR
This paper develops and evaluates language models trained on Flemish TV subtitles to improve speech recognition for subtitling, demonstrating the benefits of domain-specific models and interpolation with general language models.
Contribution
It introduces several subtitle-based language models for spoken Dutch, tailored to TV show domains, and discusses their integration with larger general models for better speech recognition.
Findings
Models trained on TV subtitles improve speech recognition accuracy.
Interpolating subtitle models with large general models enhances performance.
Subtitle-based models are valuable for low-resource spoken language tasks.
Abstract
In Flanders, all TV shows are subtitled. However, the process of subtitling is a very time-consuming one and can be sped up by providing the output of a speech recognizer run on the audio of the TV show, prior to the subtitling. Naturally, this speech recognition will perform much better if the employed language model is adapted to the register and the topic of the program. We present several language models trained on subtitles of television shows provided by the Flemish public-service broadcaster VRT. This data was gathered in the context of the project STON which has as purpose to facilitate the process of subtitling TV shows. One model is trained on all available data (46M word tokens), but we also trained models on a specific type of TV show or domain/topic. Language models of spoken language are quite rare due to the lack of training data. The size of this corpus is relatively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Subtitles and Audiovisual Media · Interpreting and Communication in Healthcare
