Multimodal Semi-supervised Learning Framework for Punctuation Prediction   in Conversational Speech

Monica Sunkara; Srikanth Ronanki; Dhanush Bekal; Sravan Bodapati,; Katrin Kirchhoff

arXiv:2008.00702·eess.AS·August 4, 2020

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati,, Katrin Kirchhoff

PDF

1 Datasets

TL;DR

This paper presents a multimodal semi-supervised learning framework for punctuation prediction in conversational speech, leveraging unlabelled audio and text data with attention-based fusion to improve accuracy and robustness.

Contribution

It introduces an attention-based multimodal fusion method combined with semi-supervised learning, outperforming traditional forced alignment approaches in punctuation prediction tasks.

Findings

01

Achieved 6-9% and 3-4% absolute F1 score improvements over baseline on reference and ASR outputs.

02

Data augmentation with N-best lists further improved ASR output performance by 2-6%.

03

Training on just 1 hour of data yielded 9-18% absolute improvement over baseline.

Abstract

In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As an alternative, we explore attention based multimodal fusion and compare its performance with forced alignment based fusion. Experiments conducted on the Fisher corpus show that our proposed approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the baseline BLSTM model on reference transcripts and ASR outputs respectively. We further improve the model robustness to ASR errors by performing data augmentation with N-best lists which achieves up to an additional ~2-6%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

clarin-pl/2021-punctuation-restoration
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.