Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Hugo Bohy; Minh Tran; Kevin El Haddad; Thierry Dutoit; Mohammad Soleymani

arXiv:2508.17502·cs.CV·August 26, 2025

Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani

PDF

TL;DR

Social-MAE is a transformer-based multimodal autoencoder pre-trained on social audiovisual data, achieving state-of-the-art results in emotion recognition and laughter detection, and competitive results in personality estimation.

Contribution

It introduces Social-MAE, a novel self-supervised pre-training framework for audiovisual social data, extending CAV-MAE to handle more frames and larger datasets.

Findings

01

State-of-the-art in multimodal emotion recognition

02

Best performance in laughter detection

03

Competitive results in personality estimation

Abstract

Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.