AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion
Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

TL;DR
This paper introduces AAS-VC, a non-autoregressive voice conversion model that improves generalization on small datasets by eliminating reliance on external duration models, demonstrating superior performance with only 5 minutes of training data.
Contribution
The paper proposes AAS-VC, a novel non-AR seq2seq voice conversion model using automatic alignment search to enhance generalization on limited data.
Findings
AAS-VC outperforms traditional models on small datasets.
Eliminating external duration dependency improves generalization.
AAS-VC achieves effective voice conversion with only 5 minutes of training data.
Abstract
Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generalization ability to smaller training datasets. In this paper, we first demonstrate the above-mentioned problem by varying the training data size. Then, we present AAS-VC, a non-AR seq2seq VC model based on automatic alignment search (AAS), which removes the dependency on external durations and serves as a proper inductive bias to provide the required generalization ability for small datasets. Experimental results show that AAS-VC can generalize better to a training dataset of only 5 minutes. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
