CopyPaste: An Augmentation Method for Speech Emotion Recognition
Raghavendra Pappagari, Jes\'us Villalba, Piotr \.Zelasko, Laureano, Moro-Velazquez, Najim Dehak

TL;DR
This paper introduces CopyPaste, a novel data augmentation method for speech emotion recognition that improves model robustness and accuracy across multiple datasets and noise conditions by concatenating emotional and neutral utterances.
Contribution
The study proposes a new augmentation technique, CopyPaste, that leverages concatenation of emotional and neutral speech to enhance SER performance, outperforming traditional noise augmentation methods.
Findings
CopyPaste improves SER accuracy on all tested datasets.
CopyPaste outperforms noise augmentation in experiments.
Combining CopyPaste with noise augmentation yields further improvements.
Abstract
Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictates a speaker's overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We hypothesize that SER performance can be improved using these concatenated utterances in model training. To verify this, three CopyPaste schemes are tested on two deep learning models: one trained independently and another using transfer learning from an x-vector model, a speaker recognition model. We observed that all three CopyPaste…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
