End-to-End Whisper to Natural Speech Conversion using Modified   Transformer Network

Abhishek Niranjan; Mukesh Sharma; Sai Bharath Chandra Gutha; M Ali; Basha Shaik

arXiv:2004.09347·eess.AS·April 6, 2021·6 cites

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

Abhishek Niranjan, Mukesh Sharma, Sai Bharath Chandra Gutha, M Ali, Basha Shaik

PDF

Open Access

TL;DR

This paper presents a novel end-to-end transformer-based method for converting whispered speech into natural speech, utilizing both parallel and non-parallel data, and introduces a new formant divergence metric for spectral shape comparison.

Contribution

It introduces an enhanced transformer architecture with auxiliary decoders for whisper-to-speech conversion, a new spectral shape measurement method, and demonstrates effectiveness on open datasets.

Findings

01

Reduced word error rate in ASR evaluations

02

Achieved high BLEU scores for speech quality

03

Formant distribution similarity to ground truth

Abstract

Machine recognition of an atypical speech like whispered speech, is a challenging task. We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach by proposing enhanced transformer architecture, which uses both parallel and non-parallel data. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation. Further, we also investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, trained with the frame-level objective function for identifying source phoneme labels. We show results on opensource wTIMIT and CHAINS datasets by measuring word error rate using end-to-end ASR and also BLEU scores for the generated speech. Alternatively, we also propose a novel method to measure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax