Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition
Arman Martirosyan, Shahane Tigranyan, Maria Razzhivina, Artak Aslanyan, Nazgul Salikhova, Ilya Makarov, Andrey Savchenko, Aram Avetisyan

TL;DR
This paper introduces multimodal frameworks for micro-gesture recognition and emotion prediction using video and skeletal data, achieving high accuracy on the iMiGUE dataset and securing second place in a challenge.
Contribution
It presents novel multimodal fusion methods combining RGB, 3D pose, facial, and contextual data for fine-grained behavior analysis and emotion recognition.
Findings
Achieved high accuracy in micro-gesture classification.
Secured 2nd place in the MiGA 2025 Challenge.
Demonstrated effective multimodal fusion for emotion prediction.
Abstract
Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Multimodal Machine Learning Applications
