Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction
Daria Soboleva, Ondrej Skopek, M\'arius \v{S}ajgal\'ik, Victor, C\u{a}rbune, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel, Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos

TL;DR
This paper introduces a multi-modal system for unspoken punctuation prediction that uses synthetic speech data to outperform models trained on real audio, enabling efficient on-device deployment.
Contribution
The study demonstrates that synthetic data from a prosody-aware TTS system can replace costly human audio for punctuation prediction, with a compact model suitable for on-device use.
Findings
Synthetic data outperforms human audio in model training
Model achieves low latency suitable for on-device deployment
Hash-based embeddings effectively combine acoustic and text features
Abstract
We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
