Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming   Disfluency Detection

Angelica Chen; Vicky Zayats; Daniel D. Walker; Dirk Padfield

arXiv:2205.00620·cs.CL·May 3, 2022

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

Angelica Chen, Vicky Zayats, Daniel D. Walker, Dirk Padfield

PDF

Open Access

TL;DR

This paper introduces a streaming BERT-based model for real-time disfluency detection in speech, balancing accuracy and latency by dynamically adjusting its lookahead window, achieving state-of-the-art performance.

Contribution

A novel training objective and model architecture enable real-time disfluency detection with dynamic lookahead, improving latency and stability over existing methods.

Findings

01

Achieves comparable accuracy to non-incremental models

02

Reduces detection latency and flicker in predictions

03

Attains state-of-the-art latency and stability scores

Abstract

In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Layer Normalization · Softmax · Dense Connections · Dropout