Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers
Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa

TL;DR
This paper explores the use of vision transformer models, specifically ViT and BEiT, for speech emotion recognition in human-robot interaction, demonstrating improved accuracy through fine-tuning and ensemble methods on individual speech data.
Contribution
It introduces a novel application of vision transformers for speech emotion recognition in HRI and shows how fine-tuning and ensembling enhance individual emotion classification accuracy.
Findings
Fine-tuning vision transformers improves emotion recognition accuracy.
Ensembling ViT and BEiT models yields the best results.
Models effectively classify four primary emotions from speech.
Abstract
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT-based Smart Home Systems · Robotics and Automated Systems · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Linear Layer · Softmax · Layer Normalization · Dense Connections · Residual Connection · Multi-Head Attention · Vision Transformer · Focus
