A Multimodal Framework for Human-Multi-Agent Interaction

Shaid Hasan; Breenice Lee; Sujan Sarker; and Tariq Iqbal

arXiv:2603.23271·cs.RO·March 25, 2026

A Multimodal Framework for Human-Multi-Agent Interaction

Shaid Hasan, Breenice Lee, Sujan Sarker, and Tariq Iqbal

PDF

Open Access

TL;DR

This paper presents a unified multimodal framework enabling human-multi-robot interaction with autonomous agents using perception, embodied expression, and LLM-driven planning for natural, coordinated behaviors.

Contribution

It introduces a novel integrated framework combining perception, embodiment, and LLM-based planning for multi-robot human interaction, addressing existing integration challenges.

Findings

01

Successful implementation on two humanoid robots

02

Demonstrated coordinated multimodal reasoning and responses

03

Enabled natural multi-agent interaction with speech, gesture, gaze, and locomotion

Abstract

Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · Multimodal Machine Learning Applications