Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots
Songhua Yang, Xuetao Li, Xuanye Fei, Mengde Li, Miao Li

TL;DR
This paper introduces SeM$^2$, a multimodal framework leveraging Vision Language Models to enable emotionally coherent speech, facial expressions, and gestures in humanoid robots, suitable for on-device deployment and real-world interaction.
Contribution
The paper presents SeM$^2$, a novel VLM-based framework with a Semantic-Sequence Aligning Mechanism for synchronized multimodal human-robot interaction, including an efficient edge-deployable version.
Findings
Edge deployment retains 95% performance of cloud version.
Outperforms unimodal baselines in naturalness and emotional clarity.
Effective coordination of speech, facial expressions, and gestures.
Abstract
Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Robot Manipulation and Learning
