Disentangleing Content and Fine-grained Prosody Information via Hybrid   ASR Bottleneck Features for Voice Conversion

Xintao Zhao; Feng Liu; Changhe Song; Zhiyong Wu; Shiyin Kang; Deyi; Tuo; Helen Meng

arXiv:2203.12813·cs.SD·March 25, 2022·1 cites

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Xintao Zhao, Feng Liu, Changhe Song, Zhiyong Wu, Shiyin Kang, Deyi, Tuo, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a hybrid approach combining different ASR bottleneck features to improve voice conversion quality by disentangling content and prosody information, leading to more natural and similar speech outputs.

Contribution

The paper proposes a novel hybrid bottleneck feature extraction method using CTC and CE trained ASR models for improved voice conversion performance.

Findings

01

Higher similarity and naturalness in converted speech.

02

Effective disentanglement of content and prosody information.

03

Insights into the information contained in different BNFs.

Abstract

Non-parallel data voice conversion (VC) have achieved considerable breakthroughs recently through introducing bottleneck features (BNFs) extracted by the automatic speech recognition(ASR) model. However, selection of BNFs have a significant impact on VC result. For example, when extracting BNFs from ASR trained with Cross Entropy loss (CE-BNFs) and feeding into neural network to train a VC system, the timbre similarity of converted speech is significantly degraded. If BNFs are extracted from ASR trained using Connectionist Temporal Classification loss (CTC-BNFs), the naturalness of the converted speech may decrease. This phenomenon is caused by the difference of information contained in BNFs. In this paper, we proposed an any-to-one VC method using hybrid bottleneck features extracted from CTC-BNFs and CE-BNFs to complement each other advantages. Gradient reversal layer and instance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsInstance Normalization · HiFi-GAN