Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara

TL;DR
This paper introduces a two-stage multimodal framework for predicting emotion mimicry intensity from videos, combining textual, acoustic, visual, and optional motion data, achieving competitive results in a challenge.
Contribution
The paper presents a staged multimodal approach with modality-specific encoders and a fusion strategy, providing a practical baseline for emotion mimicry intensity prediction.
Findings
Best validation Pearson correlation of 0.4722 with text-audio-vision-motion fusion.
Achieved third place in the EMI challenge with a Pearson correlation of 0.57 on the test set.
Motion branch adds slight gains but offers interesting insights.
Abstract
We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
