End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining
Saksham Bassi, Giulio Duregon, Siddhartha Jalagam, David Roth

TL;DR
This paper demonstrates that large-scale audio language model pretraining enables end-to-end speech recognition systems to match or surpass two-stage models in disfluency removal, highlighting the importance of pretraining objectives.
Contribution
It shows that audio-based language models pretrained with weak self-supervised objectives can effectively perform disfluency removal, challenging the dominance of two-stage models.
Findings
Pretrained audio language models match or outperform two-stage models.
Pretraining objectives significantly impact disfluency removal performance.
End-to-end models benefit from recent large-scale audio pretraining advances.
Abstract
The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
