Multimodal Punctuation Prediction with Contextual Dropout

Andrew Silva; Barry-John Theobald; Nicholas Apostoloff

arXiv:2102.11012·cs.CL·February 23, 2021

Multimodal Punctuation Prediction with Contextual Dropout

Andrew Silva, Barry-John Theobald, Nicholas Apostoloff

PDF

TL;DR

This paper introduces a transformer-based multimodal approach for punctuation prediction in speech recognition, utilizing audio and text data, and proposes a contextual dropout technique to handle varying future context during testing.

Contribution

It presents a novel multimodal model combining text and audio for improved punctuation prediction and introduces a contextual dropout method for flexible context handling.

Findings

01

Achieved 8% improvement on IWSLT 2012 TED Task over previous methods.

02

Multimodal model outperforms text-only approach by 8% on internal dataset.

03

Proposed contextual dropout enables effective handling of variable future context.

Abstract

Automatic speech recognition (ASR) is widely used in consumer electronics. ASR greatly improves the utility and accessibility of technology, but usually the output is only word sequences without punctuation. This can result in ambiguity in inferring user-intent. We first present a transformer-based approach for punctuation prediction that achieves 8% improvement on the IWSLT 2012 TED Task, beating the previous state of the art [1]. We next describe our multimodal model that learns from both text and audio, which achieves 8% improvement over the text-only algorithm on an internal dataset for which we have both the audio and transcriptions. Finally, we present an approach to learning a model using contextual dropout that allows us to handle variable amounts of future context at test time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout