End-to-End Speech Recognition and Disfluency Removal with Acoustic   Language Model Pretraining

Saksham Bassi; Giulio Duregon; Siddhartha Jalagam; David Roth

arXiv:2309.04516·eess.AS·September 12, 2023·1 cites

End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining

Saksham Bassi, Giulio Duregon, Siddhartha Jalagam, David Roth

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that large-scale audio language model pretraining enables end-to-end speech recognition systems to match or surpass two-stage models in disfluency removal, highlighting the importance of pretraining objectives.

Contribution

It shows that audio-based language models pretrained with weak self-supervised objectives can effectively perform disfluency removal, challenging the dominance of two-stage models.

Findings

01

Pretrained audio language models match or outperform two-stage models.

02

Pretraining objectives significantly impact disfluency removal performance.

03

End-to-end models benefit from recent large-scale audio pretraining advances.

Abstract

The SOTA in transcription of disfluent and conversational speech has in recent years favored two-stage models, with separate transcription and cleaning stages. We believe that previous attempts at end-to-end disfluency removal have fallen short because of the representational advantage that large-scale language model pretraining has given to lexical models. Until recently, the high dimensionality and limited availability of large audio datasets inhibited the development of large-scale self-supervised pretraining objectives for learning effective audio representations, giving a relative advantage to the two-stage approach, which utilises pretrained representations for lexical tokens. In light of recent successes in large scale audio pretraining, we revisit the performance comparison between two-stage and end-to-end model and find that audio based language models pretrained using weak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidsroth/hubert-disfl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing