Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection
Angelica Chen, Vicky Zayats, Daniel D. Walker, Dirk Padfield

TL;DR
This paper introduces a streaming BERT-based model for real-time disfluency detection in speech, balancing accuracy and latency by dynamically adjusting its lookahead window, achieving state-of-the-art performance.
Contribution
A novel training objective and model architecture enable real-time disfluency detection with dynamic lookahead, improving latency and stability over existing methods.
Findings
Achieves comparable accuracy to non-incremental models
Reduces detection latency and flicker in predictions
Attains state-of-the-art latency and stability scores
Abstract
In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Softmax · Dense Connections · Dropout
