Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

TL;DR
This paper introduces HILA, a framework enabling multi-agent LLM systems to learn when to act autonomously or defer to humans, improving collaboration and adaptability through continual learning and a dual-loop policy optimization approach.
Contribution
It proposes a novel human-in-the-loop multi-agent framework with a dual-loop policy optimization method for improved collaboration and continual learning in multi-agent LLM systems.
Findings
HILA outperforms existing multi-agent systems on complex benchmarks.
The dual-loop optimization enhances decision-making and long-term learning.
Continual learning improves agent reasoning over time.
Abstract
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner…
Peer Reviews
Decision·ICLR 2026 Poster
* Paper is well-written, technically sound, and easy to follow. * The method proposed is novel and principled. The option to defer to expert is interesting for a multi-agent system. And the dual-loop training process combined into a single DLPO loss is principled and easy to implement. * The empirical evaluations are extensive across multiple benchmarks, as well as ablation and scaling studies, which all demonstrate the superiority of the method proposed.
* Cost model is constant $C$, which seems to be a strong assumption. There are many scenarios where querying the expert with different levels of question would incur different costs. It is also unclear how sensitive the outcomes are to C. * "Human" is proxied by gpt-4o-mini for the main experiments. It is interesting to see how this would scale with more capable models like gpt-4o.
- Novel Framework: The idea that an agent can adaptively seek help from humans based on its own capabilities while continually improving itself through higher-level feedback is novel. - Method's Effectiveness: From paper's table 1, LIMA and LIMA (w/ DLPO) acquire a significant gains compared to baseline models across six representative benchmarks.
- For human expert: The paper mentions “real human experts” but does not clarify who these participants are, how they were selected, or whether the framework accounts for human cognitive cost or fatigue. - Performance gains: The relationship between this performance gain and external feedback requires careful analysis. - Presentation: Table 2 column 1 Model. - Although the proposed method achieves improvements across the six benchmarks in Table 1, its performance remains below the sota (broad pe
Novelty of the idea is good
- Could include further related work such as "LLM-Mediated Guidance of MARL Systems" (Siedler et al) to "human-in-the-loop" paragraph - Figure 1 has not been incorporated in the text (sorry if i missed this) - A flow/process diagram supporting the methodology would elevate comprehension drastically - I would not consider MMLU as a "general knowledge reasoning" benchmark - its language understanding - I think there should have been either more experiments for models such as "GPT-4o-mini as a prox
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics
