A vector quantized masked autoencoder for audiovisual speech emotion recognition

Samir Sadok; Simon Leglaive; Renaud S\'eguier

arXiv:2305.03568·cs.SD·May 12, 2025·5 cites

A vector quantized masked autoencoder for audiovisual speech emotion recognition

Samir Sadok, Simon Leglaive, Renaud S\'eguier

PDF

Open Access

TL;DR

This paper introduces VQ-MAE-AV, a self-supervised multimodal model that learns audiovisual speech representations from unlabeled data using masked autoencoders and vector quantization, improving emotion recognition.

Contribution

The paper presents a novel self-supervised framework combining vector quantized autoencoders and masked autoencoders for audiovisual speech emotion recognition.

Findings

01

Achieves state-of-the-art results on multiple emotion recognition datasets.

02

Effectively leverages unlabeled audiovisual speech data for representation learning.

03

Improves robustness in both controlled and in-the-wild conditions.

Abstract

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques

MethodsMasked autoencoder