Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus
Yaoming Zhu, Liwei Wu, Shanbo Cheng, Mingxuan Wang

TL;DR
The paper introduces UniPunc, a unified multimodal framework that effectively punctuates mixed-modality transcriptions by jointly representing audio and text, outperforming existing models on real-world datasets.
Contribution
UniPunc is the first model to jointly represent audio and text in a shared space for punctuation restoration on mixed-modality data, enabling a single model to handle both types.
Findings
Outperforms strong baselines by at least 0.8 F1 score
Achieves state-of-the-art results on real-world datasets
Enables existing models to punctuate mixed corpus with UniPunc's design
Abstract
The punctuation restoration task aims to correctly punctuate the output transcriptions of automatic speech recognition systems. Previous punctuation models, either using text only or demanding the corresponding audio, tend to be constrained by real scenes, where unpunctuated sentences are a mixture of those with and without audio. This paper proposes a unified multimodal punctuation restoration framework, named UniPunc, to punctuate the mixed sentences with a single model. UniPunc jointly represents audio and non-audio samples in a shared latent space, based on which the model learns a hybrid representation and punctuates both kinds of samples. We validate the effectiveness of the UniPunc on real-world datasets, which outperforms various strong baselines (e.g. BERT, MuSe) by at least 0.8 overall F1 scores, making a new state-of-the-art. Extensive experiments show that UniPunc's design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Dropout · Adam · Layer Normalization · Attention Dropout · Weight Decay
