MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning
Kuo Yang, Xingjie Yang, Linhui Yu, Qing Xu, Yan Fang, Xu Wang, Zhengyang Zhou, Yang Wang

TL;DR
MasHost is a novel reinforcement learning framework that autonomously constructs multi-agent systems by optimizing agent roles, interactions, and rationality, outperforming existing semi-autonomous methods on multiple benchmarks.
Contribution
This work introduces MasHost, the first RL-based framework for autonomous multi-agent system construction with a focus on structure rationality and multi-objective optimization.
Findings
Outperforms most baselines on six benchmarks
Demonstrates effectiveness and efficiency of the RL-based approach
Validates the importance of component rationality in Mas design
Abstract
Large Language Model (LLM)-driven Multi-agent systems (Mas) have recently emerged as a powerful paradigm for tackling complex real-world tasks. However, existing Mas construction methods typically rely on manually crafted interaction mechanisms or heuristic rules, introducing human biases and constraining the autonomous ability. Even with recent advances in adaptive Mas construction, existing systems largely remain within the paradigm of semi-autonomous patterns. In this work, we propose MasHost, a Reinforcement Learning (RL)-based framework for autonomous and query-adaptive Mas design. By formulating Mas construction as a graph search problem, our proposed MasHost jointly samples agent roles and their interactions through a unified probabilistic sampling mechanism. Beyond the accuracy and efficiency objectives pursued in prior works, we introduce component rationality as an additional…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Creative integration of hierarchical RL with LLM-based multi-agent systems. 2. The proposed JPSS and HRPO mechanisms are technically sound and novel, jointly optimizing node and edge decisions with structured reward shaping for efficiency and rationality. 3. Extensive experiments on reasoning and coding benchmarks demonstrate consistent improvements over strong baselines, supported by ablation and interpretability analyses.
1. The paper requires further language and formatting refinement. There are multiple inconsistencies in mathematical notation and terminology. For example, the symbol `R` in line 145 is redefined several times and does not align with `r` in Equation (1); the word `DELTE` in line 226 should be `DELETE`. A unified notation system and simpler symbol style would greatly improve readability. 2. The work lacks strong originality. Conceptually, it mainly applies a hierarchical reinforcement learning
(1) This work pioneers the first RL-driven framework for fully autonomous Mas graph construction, breaking free from the constraints of traditional semi-autonomous paradigms (e.g., candidate pool sampling, fixed workflows). By addressing the gradient discretization issue in dual-decision (role-connection) processes through JPSS and enabling multi-objective (performance-efficiency-rationality) coordinated optimization via HRPO, it fills a critical gap in the field. (2) The innovative introduction
(1) The paper references a "full-scale role space" but provides no clarity on its specific composition (e.g., number of roles, classification criteria) or construction method (manual definition vs. automatic generation). If the role space relies on manual initialization, it may introduce implicit biases, conflicting with the goal of "full autonomy." (2) Experiments only use GPT-4o-mini as the LLM executor, without testing the framework’s performance with LLMs of varying capabilities (e.g., GPT-4
MasHost directly targets autonomy in multi agent LLM systems, which is an increasingly high impact problem. In most agent frameworks today, a human designs a fixed graph of roles like “planner”, “retriever”, “coder”, “critic”, and wires them together by hand. Even the more “adaptive” methods tend to select from a catalog of known roles and stitch them using heuristic routing, so they are still semi automatic rather than truly self organizing. Here, the authors explicitly state that MasHost is me
There are some open questions that could weaken the paper if not addressed clearly in the main text. The most important one is how exactly MasHost is trained. The abstract frames MasHost as an RL policy over graphs, and introduces HRPO, but does not describe how they get reward signals. Do they actually execute the proposed multi agent team on the downstream benchmark task each episode, collect a final task reward, and backpropagate credit using policy gradients over the sampled graph actions. I
* The problem of automating MAS architecture design is both novel and challenging, fitting well within the emerging trend of self-organizing multi-agent systems. * The proposed JPSS + HRPO combination presents a clean and unified training pipeline that bridges architecture search and RL optimization. * The experiments are fairly comprehensive in terms of task variety, demonstrating potential generality. * The qualitative examples provide a clear sense of how the MAS evolves and learns agent s
* The executor dependency is a serious concern: results rely heavily on GPT-4o-mini (with temperature = 0), and no experiments are provided using alternative language models. This limits reproducibility and generality. * The technical details of HRPO are underdeveloped. It is unclear how the advantage function is computed when two interdependent stochastic policies (role and edge) are being optimized simultaneously. There is no discussion of gradient variance or credit assignment. * The **rati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Big Data and Digital Economy
MethodsMixing Adam and SGD
