Detecting expressions with multimodal transformers
Srinivas Parthasarathy, Shiva Sundaram

TL;DR
This paper explores deep learning models, including transformers, for detecting person expressions through audio-visual cues, demonstrating improved performance over recurrent models and emphasizing the importance of visual data.
Contribution
It introduces a transformer-based architecture for audio-visual expression detection, outperforming recurrent models and highlighting multimodal integration benefits.
Findings
Transformer models outperform recurrent baselines by ~2% in arousal and valence detection.
Multimodal models outperform single modality models by up to 3.6%.
Visual modality is crucial for accurate expression detection.
Abstract
Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
