Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech
Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati,, Katrin Kirchhoff

TL;DR
This paper presents a multimodal semi-supervised learning framework for punctuation prediction in conversational speech, leveraging unlabelled audio and text data with attention-based fusion to improve accuracy and robustness.
Contribution
It introduces an attention-based multimodal fusion method combined with semi-supervised learning, outperforming traditional forced alignment approaches in punctuation prediction tasks.
Findings
Achieved 6-9% and 3-4% absolute F1 score improvements over baseline on reference and ASR outputs.
Data augmentation with N-best lists further improved ASR output performance by 2-6%.
Training on just 1 hour of data yielded 9-18% absolute improvement over baseline.
Abstract
In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As an alternative, we explore attention based multimodal fusion and compare its performance with forced alignment based fusion. Experiments conducted on the Fisher corpus show that our proposed approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the baseline BLSTM model on reference transcripts and ASR outputs respectively. We further improve the model robustness to ASR errors by performing data augmentation with N-best lists which achieves up to an additional ~2-6%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
