Streaming Joint Speech Recognition and Disfluency Detection
Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao, Okuda, Siddhant Arora, Shinji Watanabe

TL;DR
This paper introduces Transformer-based streaming models that jointly perform speech recognition and disfluency detection, improving accuracy and latency over traditional pipeline methods by leveraging acoustic information and multi-task learning.
Contribution
The study presents novel joint Transformer models for streaming speech recognition and disfluency detection, addressing latency and adaptation issues of previous methods.
Findings
Joint models outperform pipeline approaches in accuracy.
Multi-task model reduces latency and improves robustness.
Models tested on Switchboard and Japanese spontaneous speech datasets.
Abstract
Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Residual Connection · Softmax · Adam
