Streaming Punctuation for Long-form Dictation with Transformers

Piyush Behre; Sharman Tan; Padma Varadharajan; Shuangyu Chang

arXiv:2210.05756·cs.CL·December 7, 2022·1 cites

Streaming Punctuation for Long-form Dictation with Transformers

Piyush Behre, Sharman Tan, Padma Varadharajan, Shuangyu Chang

PDF

Open Access

TL;DR

This paper introduces a streaming punctuation method for long-form speech dictation using transformers, improving segmentation and punctuation accuracy in real-time ASR systems, and enhancing downstream machine translation quality.

Contribution

It presents a novel streaming approach with dynamic decoding windows for punctuation in ASR, addressing real-time constraints and improving segmentation and translation performance.

Findings

01

Segmentation F0.5-score improved by 13.9%.

02

BLEU-score for MT increased by 0.66 on average.

03

Effective long-context punctuation in real-time ASR.

Abstract

While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam