A multimodal dynamical variational autoencoder for audiovisual speech   representation learning

Samir Sadok; Simon Leglaive; Laurent Girin; Xavier Alameda-Pineda,; Renaud S\'eguier

arXiv:2305.03582·cs.SD·February 21, 2024·1 cites

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda,, Renaud S\'eguier

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal dynamical variational autoencoder that learns unsupervised audiovisual speech representations, effectively disentangling static and dynamic factors, and improves emotion recognition accuracy using minimal labeled data.

Contribution

The novel MDVAE model structures the latent space to separate shared and modality-specific factors, and employs a two-stage training process with VQ-VAE and MDVAE for unsupervised audiovisual speech analysis.

Findings

01

MDVAE effectively encodes audiovisual speech in its latent space.

02

Static representations learned by MDVAE improve emotion recognition accuracy.

03

The model outperforms unimodal baselines and state-of-the-art supervised models.

Abstract

In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samsad35/code-mdvae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Adaptive Filtering Techniques