A Multimodal Framework for Human-Multi-Agent Interaction
Shaid Hasan, Breenice Lee, Sujan Sarker, and Tariq Iqbal

TL;DR
This paper presents a unified multimodal framework enabling human-multi-robot interaction with autonomous agents using perception, embodied expression, and LLM-driven planning for natural, coordinated behaviors.
Contribution
It introduces a novel integrated framework combining perception, embodiment, and LLM-based planning for multi-robot human interaction, addressing existing integration challenges.
Findings
Successful implementation on two humanoid robots
Demonstrated coordinated multimodal reasoning and responses
Enabled natural multi-agent interaction with speech, gesture, gaze, and locomotion
Abstract
Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · Multimodal Machine Learning Applications
