Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens
Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota, Orihashi, Ryo Masumura

TL;DR
This paper introduces a zero-shot joint modeling approach for multiple spoken-text-style conversion tasks, using switching tokens to improve readability of speech transcriptions without needing matched datasets.
Contribution
The paper presents a novel method employing switching tokens for zero-shot joint modeling of multiple conversion tasks, avoiding dataset matching and cascading errors.
Findings
Effective joint modeling of disfluency deletion and punctuation restoration.
Improved readability of speech transcriptions in experiments.
Reduced computational cost compared to cascading methods.
Abstract
In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
