Foundation Model Assisted Automatic Speech Emotion Recognition:   Transcribing, Annotating, and Augmenting

Tiantian Feng; Shrikanth Narayanan

arXiv:2309.08108·cs.SD·September 18, 2023·1 cites

Foundation Model Assisted Automatic Speech Emotion Recognition: Transcribing, Annotating, and Augmenting

Tiantian Feng, Shrikanth Narayanan

PDF

Open Access

TL;DR

This paper investigates how foundational models can automate and improve speech emotion recognition by transcribing, annotating, and augmenting datasets, reducing manual effort and enhancing system performance.

Contribution

It introduces a novel approach leveraging foundational models to automate transcription, annotation, and augmentation in speech emotion recognition tasks.

Findings

01

Foundational models improve transcription quality for SER.

02

Combining multiple LLM outputs enhances emotion annotation accuracy.

03

Augmentation of datasets with unlabeled speech is feasible and beneficial.

Abstract

Significant advances are being made in speech emotion recognition (SER) using deep learning models. Nonetheless, training SER systems remains challenging, requiring both time and costly resources. Like many other machine learning tasks, acquiring datasets for SER requires substantial data annotation efforts, including transcription and labeling. These annotation processes present challenges when attempting to scale up conventional SER systems. Recent developments in foundational models have had a tremendous impact, giving rise to applications such as ChatGPT. These models have enhanced human-computer interactions including bringing unique possibilities for streamlining data collection in fields like SER. In this research, we explore the use of foundational models to assist in automating SER from transcription and annotation to augmentation. Our study demonstrates that these models can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Speech and dialogue systems