Disentanglement for audio-visual emotion recognition using multitask setup
Raghuveer Peri, Srinivas Parthasarathy, Charles Bradshaw, Shiva, Sundaram

TL;DR
This paper proposes a multitask learning framework for audio-visual emotion recognition that disentangles emotion-specific features from person identity information, improving interpretability without sacrificing accuracy.
Contribution
It introduces a novel disentanglement approach within a multitask setup to isolate emotion-related features from identity cues in multimodal data.
Findings
Achieved up to 13% disentanglement of features.
Maintained state-of-the-art emotion recognition performance.
Compared three disentanglement techniques.
Abstract
Deep learning models trained on audio-visual data have been successfully used to achieve state-of-the-art performance for emotion recognition. In particular, models trained with multitask learning have shown additional performance improvements. However, such multitask models entangle information between the tasks, encoding the mutual dependencies present in label distributions in the real world data used for training. This work explores the disentanglement of multimodal signal representations for the primary task of emotion recognition and a secondary person identification task. In particular, we developed a multitask framework to extract low-dimensional embeddings that aim to capture emotion specific information, while containing minimal information related to person identity. We evaluate three different techniques for disentanglement and report results of up to 13% disentanglement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
