Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge
Masoumeh Chapariniya, Aref Farhadipour, Sarah Ebling, Volker Dellwo, Teodora Vukovic

TL;DR
This paper introduces a multimodal fusion system combining six encoder types for blended emotion recognition, achieving competitive results in the BLEMORE Challenge by leveraging novel embeddings and fusion strategies.
Contribution
It presents a new multimodal fusion approach integrating large models and encoder selection strategies for emotion recognition, with state-of-the-art ensemble performance.
Findings
Frozen Wav2Vec2 prosody layers outperform finetuning.
Salience threshold varies across individuals, indicating personalized expression.
Task-specific encoders dominate ensemble weight, emphasizing specialized features.
Abstract
We present our system for the BLEMORE Challenge at FG 2026 on blended emotion recognition with relative salience prediction. Our approach combines six encoder families through late probability fusion: an S4D-ViTMoE face encoder adapted with soft-label KL training, frozen layer-selective Wav2Vec2 audio features, finetuned body-language encoders (TimeSformer, VideoMAE), and -- for the first time in emotion recognition -- Gemini Embedding 2.0, a large multimodal model whose video embeddings produce competitive presence accuracy (ACCP = 0.320) from only 2 seconds of input. Three key findings emerge from our experiments: selecting prosody-encoding layers (6--12) from frozen Wav2Vec2 outperforms end-to-end finetuning (Score 0.207 vs. 0.161), as the non-verbal nature of BLEMORE audio makes phonetic layers irrelevant; the post-processing salience threshold varies from 0.05 to 0.43…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Social Robot Interaction and HRI
