ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy
Ya-Tse Wu, Chi-Chun Lee

TL;DR
This paper explores how emotion and speech synthesis strategies influence ASR accuracy, proposing targeted data augmentation methods that improve recognition of emotional speech without harming performance on neutral speech.
Contribution
It introduces two novel generative strategies for fine-tuning ASR models using emotion-aware synthetic speech, leading to improved performance on emotional datasets.
Findings
Consistent WER improvements on emotional speech datasets.
No degradation on clean LibriSpeech utterances.
Combined strategies yield the strongest gains for expressive speech.
Abstract
This work investigates how emotional speech and generative strategies affect ASR performance. We analyze speech synthesized from three emotional TTS models and find that substitution errors dominate, with emotional expressiveness varying across models. Based on these insights, we introduce two generative strategies: one using transcription correctness and another using emotional salience, to construct fine-tuning subsets. Results show consistent WER improvements on real emotional datasets without noticeable degradation on clean LibriSpeech utterances. The combined strategy achieves the strongest gains, particularly for expressive speech. These findings highlight the importance of targeted augmentation for building emotion-aware ASR systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Topic Modeling
