CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset
Santosh Patapati, Trisanth Srinivasan, Amith Adiraju

TL;DR
This paper presents CLIP-MG, a novel architecture that combines pose and RGB data within a CLIP-based model to improve micro-gesture recognition accuracy on the iMiGUE dataset, addressing the challenge of subtle gestures.
Contribution
The paper introduces a pose-guided, semantics-aware CLIP-based model specifically designed for micro-gesture recognition, integrating skeletal pose features with visual data.
Findings
Achieved 61.82% Top-1 accuracy on iMiGUE dataset.
Demonstrated the effectiveness of pose-guided semantic queries.
Highlighted challenges in fully adapting CLIP for micro-gesture recognition.
Abstract
Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Emotion and Mood Recognition
