Advancing Multi-Robot Networks via MLLM-Driven Sensing, Communication, and Computation: A Comprehensive Survey

Hyun Jong Yang; Howon Lee; Kyuhong Shim; Jeongho Kwak; Hyunsoo Kim; Donghoon Kim; Khoa Anh Ngo; Sehyun Ryu; Jaehyun Choi; Youbin Kim; Chanjun Moon; Michael Ryoo; and Byonghyo Shim

arXiv:2604.00061·cs.RO·April 2, 2026

Advancing Multi-Robot Networks via MLLM-Driven Sensing, Communication, and Computation: A Comprehensive Survey

Hyun Jong Yang, Howon Lee, Kyuhong Shim, Jeongho Kwak, Hyunsoo Kim, Donghoon Kim, Khoa Anh Ngo, Sehyun Ryu, Jaehyun Choi, Youbin Kim, Chanjun Moon, Michael Ryoo, and Byonghyo Shim

PDF

TL;DR

This survey explores how multimodal large language models (MLLMs) can coordinate multi-robot systems by optimizing sensing, communication, and computation under resource constraints, demonstrated through four end-to-end examples.

Contribution

It provides a comprehensive review of integrated system design for multi-robot coordination guided by MLLMs, emphasizing resource-aware orchestration strategies.

Findings

01

R2X orchestration improves system-level metrics like payload, latency, and success rate.

02

Four end-to-end demonstrations showcase practical applications of MLLM-guided multi-robot coordination.

03

System-level metrics outperform on-device baselines, validating the effectiveness of integrated design.

Abstract

Imagine advanced humanoid robots, powered by multimodal large language models (MLLMs), coordinating missions across industries like warehouse logistics, manufacturing, and safety rescue. While individual robots show local autonomy, realistic tasks demand coordination among multiple agents sharing vast streams of sensor data. Communication is indispensable, yet transmitting comprehensive data can overwhelm networks, especially when a system-level orchestrator or cloud-based MLLM fuses multimodal inputs for route planning or anomaly detection. These tasks are often initiated by high-level natural language instructions. This intent serves as a filter for resource optimization: by understanding the goal via MLLMs, the system can selectively activate relevant sensing modalities, dynamically allocate bandwidth, and determine computation placement. Thus, R2X is fundamentally an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.