A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot
Erich Studerus, Vivienne Jia Zhong, Stephan Vonschallen

TL;DR
This paper introduces an open-source Android framework for the Pepper robot that enables low-latency, multimodal interaction by integrating end-to-end speech models and advanced function calling for autonomous robot control.
Contribution
It presents a novel framework that combines end-to-end speech processing with LLM-based multimodal perception and control, addressing latency and capability limitations of prior systems.
Findings
Achieved low-latency speech-to-speech interaction preserving paralinguistic cues
Enabled LLM-driven multimodal perception and autonomous robot actions
Framework is adaptable to standard Android devices
Abstract
Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)->LLM->Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM's capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · Multimodal Machine Learning Applications
