Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang; Defu Cao; Jiacheng Pang; Muyan Weng; Yan Liu

arXiv:2603.07972·cs.AI·March 10, 2026

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces HILA, a framework enabling multi-agent LLM systems to learn when to act autonomously or defer to humans, improving collaboration and adaptability through continual learning and a dual-loop policy optimization approach.

Contribution

It proposes a novel human-in-the-loop multi-agent framework with a dual-loop policy optimization method for improved collaboration and continual learning in multi-agent LLM systems.

Findings

01

HILA outperforms existing multi-agent systems on complex benchmarks.

02

The dual-loop optimization enhances decision-making and long-term learning.

03

Continual learning improves agent reasoning over time.

Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* Paper is well-written, technically sound, and easy to follow. * The method proposed is novel and principled. The option to defer to expert is interesting for a multi-agent system. And the dual-loop training process combined into a single DLPO loss is principled and easy to implement. * The empirical evaluations are extensive across multiple benchmarks, as well as ablation and scaling studies, which all demonstrate the superiority of the method proposed.

Weaknesses

* Cost model is constant $C$, which seems to be a strong assumption. There are many scenarios where querying the expert with different levels of question would incur different costs. It is also unclear how sensitive the outcomes are to C. * "Human" is proxied by gpt-4o-mini for the main experiments. It is interesting to see how this would scale with more capable models like gpt-4o.

Reviewer 02Rating 4Confidence 2

Strengths

- Novel Framework: The idea that an agent can adaptively seek help from humans based on its own capabilities while continually improving itself through higher-level feedback is novel. - Method's Effectiveness: From paper's table 1, LIMA and LIMA (w/ DLPO) acquire a significant gains compared to baseline models across six representative benchmarks.

Weaknesses

- For human expert: The paper mentions “real human experts” but does not clarify who these participants are, how they were selected, or whether the framework accounts for human cognitive cost or fatigue. - Performance gains: The relationship between this performance gain and external feedback requires careful analysis. - Presentation: Table 2 column 1 Model. - Although the proposed method achieves improvements across the six benchmarks in Table 1, its performance remains below the sota （broad pe

Reviewer 03Rating 4Confidence 4

Strengths

Novelty of the idea is good

Weaknesses

- Could include further related work such as "LLM-Mediated Guidance of MARL Systems" (Siedler et al) to "human-in-the-loop" paragraph - Figure 1 has not been incorporated in the text (sorry if i missed this) - A flow/process diagram supporting the methodology would elevate comprehension drastically - I would not consider MMLU as a "general knowledge reasoning" benchmark - its language understanding - I think there should have been either more experiments for models such as "GPT-4o-mini as a prox

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics