Two-stage training method for Japanese electrolaryngeal speech   enhancement based on sequence-to-sequence voice conversion

Ding Ma; Lester Phillip Violeta; Kazuhiro Kobayashi; Tomoki Toda

arXiv:2210.10314·cs.SD·October 20, 2022·1 cites

Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

Ding Ma, Lester Phillip Violeta, Kazuhiro Kobayashi, Tomoki Toda

PDF

Open Access

TL;DR

This paper proposes a two-stage training approach for seq2seq voice conversion to enhance Japanese electrolaryngeal speech, effectively improving performance with limited parallel data by combining synthetic and real datasets.

Contribution

The study introduces a novel two-stage training method that leverages synthetic data and fine-tuning on real data to improve EL speech conversion with limited data.

Findings

01

Performance improved with synthetic data integration

02

Two-stage training outperforms single-stage methods

03

Effective with small parallel datasets

Abstract

Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders