Self-Supervised Speech Representations Preserve Speech Characteristics   while Anonymizing Voices

Abner Hernandez; Paula Andrea P\'erez-Toro; Juan Camilo; V\'asquez-Correa; Juan Rafael Orozco-Arroyave; Andreas Maier; Seung Hee Yang

arXiv:2204.01677·cs.CL·April 5, 2022·1 cites

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Abner Hernandez, Paula Andrea P\'erez-Toro, Juan Camilo, V\'asquez-Correa, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

PDF

Open Access

TL;DR

This study explores voice conversion using self-supervised speech representations to anonymize speech data while preserving recognition accuracy and extracting relevant speech features for health diagnostics.

Contribution

It demonstrates that self-supervised speech models can effectively anonymize voices with minimal impact on speech recognition and retain key speech characteristics for analysis.

Findings

01

Converted voices maintain low word error rates (~1%)

02

Speaker verification accuracy significantly decreases after anonymization

03

Speech features relevant to health diagnostics can be extracted from anonymized voices

Abstract

Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders