Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks
G\'abor Gosztolya, L\'aszl\'o T\'oth

TL;DR
This paper explores the use of speech tempo features, BoAW, and Fisher vectors to improve emotion detection in elderly speakers and assess the impact of surgical masks on speech, achieving notable improvements in emotion recognition.
Contribution
It introduces the use of phone-level recognition features related to speech rate and hesitations for emotion detection, and applies these to the elderly and masked speech challenges.
Findings
Improved arousal and valence detection with tempo features in elderly speech.
No significant effect of masks on speech rate features.
Tempo features enhanced emotion recognition performance.
Abstract
The 2020 INTERSPEECH Computational Paralinguistics Challenge (ComParE) consists of three Sub-Challenges, where the tasks are to identify the level of arousal and valence of elderly speakers, determine whether the actual speaker wearing a surgical mask, and estimate the actual breathing of the speaker. In our contribution to the Challenge, we focus on the Elderly Emotion and the Mask sub-challenges. Besides utilizing standard or close-to-standard features such as ComParE functionals, Bag-of-Audio-Words and Fisher vectors, we exploit that emotion is related to the velocity of speech (i.e. speech rate). To utilize this, we perform phone-level recognition using an ASR system, and extract features from the output such as articulation tempo, speech tempo, and various attributes measuring the amount of pauses. We also hypothesize that wearing a surgical mask makes the speaker feel uneasy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Phonetics and Phonology Research
