Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks   using Switching Tokens

Mana Ihori; Naoki Makishima; Tomohiro Tanaka; Akihiko Takashima; Shota; Orihashi; Ryo Masumura

arXiv:2106.12131·cs.CL·June 24, 2021

Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota, Orihashi, Ryo Masumura

PDF

Open Access

TL;DR

This paper introduces a zero-shot joint modeling approach for multiple spoken-text-style conversion tasks, using switching tokens to improve readability of speech transcriptions without needing matched datasets.

Contribution

The paper presents a novel method employing switching tokens for zero-shot joint modeling of multiple conversion tasks, avoiding dataset matching and cascading errors.

Findings

01

Effective joint modeling of disfluency deletion and punctuation restoration.

02

Improved readability of speech transcriptions in experiments.

03

Reduced computational cost compared to cascading methods.

Abstract

In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications