AAS-VC: On the Generalization Ability of Automatic Alignment Search   based Non-autoregressive Sequence-to-sequence Voice Conversion

Wen-Chin Huang; Kazuhiro Kobayashi; Tomoki Toda

arXiv:2309.07598·cs.SD·September 18, 2023

AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

PDF

Open Access 2 Repos

TL;DR

This paper introduces AAS-VC, a non-autoregressive voice conversion model that improves generalization on small datasets by eliminating reliance on external duration models, demonstrating superior performance with only 5 minutes of training data.

Contribution

The paper proposes AAS-VC, a novel non-AR seq2seq voice conversion model using automatic alignment search to enhance generalization on limited data.

Findings

01

AAS-VC outperforms traditional models on small datasets.

02

Eliminating external duration dependency improves generalization.

03

AAS-VC achieves effective voice conversion with only 5 minutes of training data.

Abstract

Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing