Streaming Joint Speech Recognition and Disfluency Detection

Hayato Futami; Emiru Tsunoo; Kentaro Shibata; Yosuke Kashiwagi; Takao; Okuda; Siddhant Arora; Shinji Watanabe

arXiv:2211.08726·cs.CL·May 12, 2023

Streaming Joint Speech Recognition and Disfluency Detection

Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao, Okuda, Siddhant Arora, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

This paper introduces Transformer-based streaming models that jointly perform speech recognition and disfluency detection, improving accuracy and latency over traditional pipeline methods by leveraging acoustic information and multi-task learning.

Contribution

The study presents novel joint Transformer models for streaming speech recognition and disfluency detection, addressing latency and adaptation issues of previous methods.

Findings

01

Joint models outperform pipeline approaches in accuracy.

02

Multi-task model reduces latency and improves robustness.

03

Models tested on Switchboard and Japanese spontaneous speech datasets.

Abstract

Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hayato-futami-s/joint-asr-dysfl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Residual Connection · Softmax · Adam