Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots

Songhua Yang; Xuetao Li; Xuanye Fei; Mengde Li; Miao Li

arXiv:2602.07434·cs.RO·February 10, 2026

Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots

Songhua Yang, Xuetao Li, Xuanye Fei, Mengde Li, Miao Li

PDF

Open Access

TL;DR

This paper introduces SeM$^2$, a multimodal framework leveraging Vision Language Models to enable emotionally coherent speech, facial expressions, and gestures in humanoid robots, suitable for on-device deployment and real-world interaction.

Contribution

The paper presents SeM$^2$, a novel VLM-based framework with a Semantic-Sequence Aligning Mechanism for synchronized multimodal human-robot interaction, including an efficient edge-deployable version.

Findings

01

Edge deployment retains 95% performance of cloud version.

02

Outperforms unimodal baselines in naturalness and emotional clarity.

03

Effective coordination of speech, facial expressions, and gestures.

Abstract

Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM $^{2}$ }, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Robot Manipulation and Learning