Streaming Punctuation for Long-form Dictation with Transformers
Piyush Behre, Sharman Tan, Padma Varadharajan, Shuangyu Chang

TL;DR
This paper introduces a streaming punctuation method for long-form speech dictation using transformers, improving segmentation and punctuation accuracy in real-time ASR systems, and enhancing downstream machine translation quality.
Contribution
It presents a novel streaming approach with dynamic decoding windows for punctuation in ASR, addressing real-time constraints and improving segmentation and translation performance.
Findings
Segmentation F0.5-score improved by 13.9%.
BLEU-score for MT increased by 0.66 on average.
Effective long-context punctuation in real-time ASR.
Abstract
While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam
