Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

Nan Sun; Bo Mao; Yongchang Li; Chenxu Wang; Di Guo; Huaping Liu

arXiv:2512.00797·cs.RO·December 2, 2025

Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

Nan Sun, Bo Mao, Yongchang Li, Chenxu Wang, Di Guo, Huaping Liu

PDF

Open Access

TL;DR

This paper introduces InteractGen, a multi-agent framework powered by large language models that decomposes robot intelligence into specialized agents, enhancing human-robot collaboration and adaptability in service robots.

Contribution

The paper proposes a novel multi-agent architecture that integrates foundation models as regulated components, enabling scalable, adaptable, and socially grounded robot autonomy.

Findings

01

Improved task success rates in real-world deployment

02

Enhanced adaptability and human-robot collaboration

03

Demonstrated effectiveness over monolithic models

Abstract

Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Human-Automation Interaction and Safety · Robot Manipulation and Learning