ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks
Valentin Pelloin, Franck Dary, Nicolas Herve, Benoit Favre, Nathalie, Camelin, Antoine Laurent, Laurent Besacier

TL;DR
This paper demonstrates that large-scale ASR-generated text from diverse speech data can be effectively used to pre-train spoken language models, improving performance on various speech-related tasks.
Contribution
It introduces FlauBERT-Oral, a spoken language model trained on 19GB of ASR transcribed speech, showing its benefits over traditional models despite noisy data.
Findings
FlauBERT-Oral outperforms initial FlauBERT on downstream tasks
ASR-generated text is viable for spoken language modeling
Large-scale noisy data can enhance speech task performance
Abstract
We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
