AIPO: Learning to Reason from Active Interaction
Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari

TL;DR
AIPO introduces an active multi-agent reinforcement learning framework that enhances large language model reasoning by proactive consultation with specialized agents during training, leading to improved performance and capability expansion.
Contribution
The paper proposes AIPO, a novel multi-agent RL approach enabling LLMs to actively seek targeted guidance, overcoming exploration limitations of traditional RL algorithms.
Findings
AIPO consistently improves reasoning performance across multiple benchmarks.
AIPO generalizes well across different policy models and RL algorithms.
AIPO effectively expands the reasoning capability boundary of the policy model.
Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose , an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
