End-to-end speech recognition modeling from de-identified data
Martin Flechl, Shou-Chun Yin, Junho Park, Peter Skala

TL;DR
This paper presents a two-step method to recover speech recognition performance lost due to data de-identification, by replacing PII with artificial audio and labels, achieving near-original accuracy while maintaining privacy.
Contribution
The authors introduce a novel approach combining artificial audio generation and data augmentation to mitigate performance loss from de-identification in speech recognition models.
Findings
Recovered up to 90% of performance degradation for PII recognition.
Maintained strong diarization performance despite data modifications.
Effective across different PII categories in medical speech data.
Abstract
De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
