Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization
Farida Mohsen, Ali Safa

TL;DR
This paper introduces a resource-efficient, multimodal framework for real-time human-robot interaction intent detection using RGB video, demonstrating strong generalization across different cameras and environments on embedded hardware.
Contribution
The work presents a novel approach combining pose and emotion cues with a data synthesis method to improve intent detection robustness on low-resource devices in diverse settings.
Findings
Achieved 0.95 AUROC in offline evaluations.
Attained 91% accuracy and 100% recall in real-world tests.
Demonstrated cross-camera and cross-environment robustness.
Abstract
Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Social Robot Interaction and HRI · Human Pose and Action Recognition
