CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Santosh Patapati; Trisanth Srinivasan; Amith Adiraju

arXiv:2506.16385·cs.CV·June 23, 2025

CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Santosh Patapati, Trisanth Srinivasan, Amith Adiraju

PDF

Open Access

TL;DR

This paper presents CLIP-MG, a novel architecture that combines pose and RGB data within a CLIP-based model to improve micro-gesture recognition accuracy on the iMiGUE dataset, addressing the challenge of subtle gestures.

Contribution

The paper introduces a pose-guided, semantics-aware CLIP-based model specifically designed for micro-gesture recognition, integrating skeletal pose features with visual data.

Findings

01

Achieved 61.82% Top-1 accuracy on iMiGUE dataset.

02

Demonstrated the effectiveness of pose-guided semantic queries.

03

Highlighted challenges in fully adapting CLIP for micro-gesture recognition.

Abstract

Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Emotion and Mood Recognition