# MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations

**Authors:** Hongchen Song, Long Zhang, Meixian Gao, Hengyuan Zhang, Thomas Hain, Linlin Shan

PMC · DOI: 10.1038/s41598-025-94727-2 · Scientific Reports · 2025-07-01

## TL;DR

This paper introduces MS-EmoBoost, a new method to improve speech emotion recognition by enhancing self-supervised speech features with emotional information from MFCC and spectrogram.

## Contribution

MS-EmoBoost is a novel strategy that enhances self-supervised speech emotion representations using deep emotional information from MFCC and spectrogram.

## Key findings

- MS-EmoBoost improves the emotional representation capability of self-supervised features like wav2vec 2.0 Base.
- The method achieves high performance on benchmark datasets IEMOCAP, EMODB, and EMOVO in terms of weighted and unweighted accuracy.
- The approach is effective not only for wav2vec 2.0 Base but also for other self-supervised features.

## Abstract

Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.

## Full-text entities

- **Diseases:** SER (MESH:D020238)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12219305/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12219305/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/PMC12219305/full.md

---
Source: https://tomesphere.com/paper/PMC12219305