SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Yuki Saito; Takuto Igarashi; Kentaro Seki; Shinnosuke Takamichi,; Ryuichi Yamamoto; Kentaro Tachibana; Hiroshi Saruwatari

arXiv:2406.07254·cs.SD·June 12, 2024

SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi,, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces SRC4VC, a smartphone-recorded speech corpus for voice conversion benchmarking, highlighting the impact of recording quality mismatch on VC performance and the effectiveness of speech enhancement.

Contribution

The creation of SRC4VC, a novel low-quality speech corpus recorded on smartphones with annotations, and its use in benchmarking VC performance under real-world conditions.

Findings

01

Recording quality mismatch degrades VC performance

02

Speech enhancement improves VC results with low-quality input

03

SRC4VC enables realistic VC benchmarking

Abstract

We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC when low-quality speech recording is given as the input. To this end, we first asked 100 crowdworkers to record their voice samples using smartphones. Then, we annotated the recorded samples with speaker-wise recording-quality scores and utterance-wise perceived emotion labels. We also benchmark SRC4VC on any-to-any VC, in which we trained a multi-speaker VC model on high-quality speech and used the SRC4VC speakers' voice samples as the source in VC. The results show that the recording quality mismatch between the training and evaluation data significantly degrades the VC performance, which can be improved by applying speech enhancement to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing