Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

Farida Mohsen; Ali Safa

arXiv:2512.17958·cs.RO·December 23, 2025

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

Farida Mohsen, Ali Safa

PDF

Open Access

TL;DR

This paper introduces a resource-efficient, multimodal framework for real-time human-robot interaction intent detection using RGB video, demonstrating strong generalization across different cameras and environments on embedded hardware.

Contribution

The work presents a novel approach combining pose and emotion cues with a data synthesis method to improve intent detection robustness on low-resource devices in diverse settings.

Findings

01

Achieved 0.95 AUROC in offline evaluations.

02

Attained 91% accuracy and 100% recall in real-world tests.

03

Demonstrated cross-camera and cross-environment robustness.

Abstract

Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Social Robot Interaction and HRI · Human Pose and Action Recognition