TL;DR
This paper investigates training an end-to-end speech recognition model that directly produces fluent transcripts from disfluent speech, comparing its performance to traditional pipeline methods and proposing new evaluation metrics.
Contribution
It demonstrates the feasibility of end-to-end models for disfluency removal and introduces two novel metrics for evaluating such integrated systems.
Findings
End-to-end models can generate fluent transcripts directly from disfluent speech.
Performance of end-to-end models is slightly below that of pipeline approaches.
Two new metrics are proposed for evaluating integrated speech recognition and disfluency removal.
Abstract
Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency detection model. We also propose two new metrics that can be used for evaluating integrated ASR and disfluency models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
