Multimodal Punctuation Prediction with Contextual Dropout
Andrew Silva, Barry-John Theobald, Nicholas Apostoloff

TL;DR
This paper introduces a transformer-based multimodal approach for punctuation prediction in speech recognition, utilizing audio and text data, and proposes a contextual dropout technique to handle varying future context during testing.
Contribution
It presents a novel multimodal model combining text and audio for improved punctuation prediction and introduces a contextual dropout method for flexible context handling.
Findings
Achieved 8% improvement on IWSLT 2012 TED Task over previous methods.
Multimodal model outperforms text-only approach by 8% on internal dataset.
Proposed contextual dropout enables effective handling of variable future context.
Abstract
Automatic speech recognition (ASR) is widely used in consumer electronics. ASR greatly improves the utility and accessibility of technology, but usually the output is only word sequences without punctuation. This can result in ambiguity in inferring user-intent. We first present a transformer-based approach for punctuation prediction that achieves 8% improvement on the IWSLT 2012 TED Task, beating the previous state of the art [1]. We next describe our multimodal model that learns from both text and audio, which achieves 8% improvement over the text-only algorithm on an internal dataset for which we have both the audio and transcriptions. Finally, we present an approach to learning a model using contextual dropout that allows us to handle variable amounts of future context at test time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout
