Importance of Different Temporal Modulations of Speech: A Tale of Two   Perspectives

Samik Sadhu; Hynek Hermansky

arXiv:2204.00065·eess.AS·March 24, 2023

Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

Samik Sadhu, Hynek Hermansky

PDF

Open Access

TL;DR

This paper investigates the significance of different temporal speech modulations for speech recognition by analyzing information content and ASR preferences, revealing that slow modulations around 3-6 Hz are most critical and improve data efficiency.

Contribution

It introduces a dual-perspective analysis of speech modulations, combining information-theoretic and data-driven approaches to identify key modulation frequencies for ASR performance.

Findings

01

Speech information is mainly in slow modulations around 3-6 Hz.

02

ASR systems prefer similar slow modulation frequencies.

03

Incorporating this knowledge reduces training data dependency.

Abstract

How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic \textit{information} in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modulations an Automatic Speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing