Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study   on Speech Emotion Recognition

Alexandra Saliba; Yuanchao Li; Ramon Sanabria; Catherine Lai

arXiv:2402.02617·cs.CL·February 6, 2024·1 cites

Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

Alexandra Saliba, Yuanchao Li, Ramon Sanabria, Catherine Lai

PDF

Open Access

TL;DR

This paper investigates how layer-wise acoustic word embeddings derived from self-supervised speech models can enhance speech emotion recognition by analyzing their layer-specific properties and comparing their effectiveness to raw representations.

Contribution

It introduces layer-wise similarity measurement of AWEs, evaluates their role in SER, and compares their performance with other speech features across multiple datasets.

Findings

01

AWEs capture significant acoustic context.

02

AWEs achieve competitive SER accuracy.

03

Layer-wise analysis reveals useful hierarchical information.

Abstract

The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. In light of this, we propose measuring layer-wise similarity between AWEs and word embeddings, aiming to further investigate the inherent context within AWEs. Moreover, we evaluate the contribution of AWEs, in comparison to other types of speech features, in the context of Speech Emotion Recognition (SER). Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis