End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network
Abhishek Niranjan, Mukesh Sharma, Sai Bharath Chandra Gutha, M Ali, Basha Shaik

TL;DR
This paper presents a novel end-to-end transformer-based method for converting whispered speech into natural speech, utilizing both parallel and non-parallel data, and introduces a new formant divergence metric for spectral shape comparison.
Contribution
It introduces an enhanced transformer architecture with auxiliary decoders for whisper-to-speech conversion, a new spectral shape measurement method, and demonstrates effectiveness on open datasets.
Findings
Reduced word error rate in ASR evaluations
Achieved high BLEU scores for speech quality
Formant distribution similarity to ground truth
Abstract
Machine recognition of an atypical speech like whispered speech, is a challenging task. We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach by proposing enhanced transformer architecture, which uses both parallel and non-parallel data. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation. Further, we also investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, trained with the frame-level objective function for identifying source phoneme labels. We show results on opensource wTIMIT and CHAINS datasets by measuring word error rate using end-to-end ASR and also BLEU scores for the generated speech. Alternatively, we also propose a novel method to measure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
