A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot

Erich Studerus; Vivienne Jia Zhong; Stephan Vonschallen

arXiv:2603.21013·cs.AI·March 24, 2026·HRI

A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot

Erich Studerus, Vivienne Jia Zhong, Stephan Vonschallen

PDF

Open Access

TL;DR

This paper introduces an open-source Android framework for the Pepper robot that enables low-latency, multimodal interaction by integrating end-to-end speech models and advanced function calling for autonomous robot control.

Contribution

It presents a novel framework that combines end-to-end speech processing with LLM-based multimodal perception and control, addressing latency and capability limitations of prior systems.

Findings

01

Achieved low-latency speech-to-speech interaction preserving paralinguistic cues

02

Enabled LLM-driven multimodal perception and autonomous robot actions

03

Framework is adaptable to standard Android devices

Abstract

Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)->LLM->Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM's capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · Multimodal Machine Learning Applications