Replacing Human Audio with Synthetic Audio for On-device Unspoken   Punctuation Prediction

Daria Soboleva; Ondrej Skopek; M\'arius \v{S}ajgal\'ik; Victor; C\u{a}rbune; Felix Weissenberger; Julia Proskurnia; Bogdan Prisacari; Daniel; Valcarce; Justin Lu; Rohit Prabhavalkar; Balint Miklos

arXiv:2010.10203·cs.LG·February 12, 2021

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

Daria Soboleva, Ondrej Skopek, M\'arius \v{S}ajgal\'ik, Victor, C\u{a}rbune, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel, Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos

PDF

TL;DR

This paper introduces a multi-modal system for unspoken punctuation prediction that uses synthetic speech data to outperform models trained on real audio, enabling efficient on-device deployment.

Contribution

The study demonstrates that synthetic data from a prosody-aware TTS system can replace costly human audio for punctuation prediction, with a compact model suitable for on-device use.

Findings

01

Synthetic data outperforms human audio in model training

02

Model achieves low latency suitable for on-device deployment

03

Hash-based embeddings effectively combine acoustic and text features

Abstract

We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.