Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Dayu Wang; Jiaye Yang; Weikang Li; Jiahui Liang; Yang Li

arXiv:2508.08882·cs.AI·October 14, 2025

Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MSARL, a multi-agent reinforcement learning framework that separates reasoning and tool use into specialized small agents, improving stability and accuracy in complex problem solving.

Contribution

MSARL is a novel multi-agent framework that explicitly decouples reasoning from tool use, enhancing performance and scalability in tool-integrated AI systems.

Findings

01

MSARL outperforms single-agent baselines in mathematical problem solving.

02

Decoupling reasoning and tool use improves stability and accuracy.

03

The architecture generalizes across diverse tool-use tasks.

Abstract

Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

Originality - Clear role separation (planning vs. tool interpretation) with a concrete collaboration reward shaping—a principled take on multi‑agent RL for tool use Quality - End‑to‑end RL implementation with grouped rollouts, per‑role objectives, and practical engineering (idle‑time mitigation via C) - Competitive results across AIME24/25, MATH500, Olympiad, AMC23; helpful ablation (trained vs. untrained dual‑agent) Clarity - System diagram, roll‑out prompts, and pseudocode (Alg. 1) make

Weaknesses

Complexity–performance trade-off - The dual-agent MSARL pipeline introduces substantial architectural and operational complexity (role separation, inter-agent messaging, subgrouped rollouts, tool interpreter integration, global batch size of 1, and a hard cap on tool calls C) without a commensurate accounting of its costs. The paper reports accuracy gains (e.g., +5.9 Pass@1 over the strongest single-agent TIR baseline) but does not quantify end-to-end training/inference latency, GPU utilization/

Reviewer 02Rating 4Confidence 4

Strengths

1. The writing is clear. 2. The experimental setup is easy to follow.

Weaknesses

1. The “cognitive overhead” evidence is narrow (three small backbones, N=5 samples, one judge), limiting generality. 2. The method is only demonstrated at 1.5B scale; larger models are not reported. 3. The abstract claims to mitigate hallucinations and to boost tool-invocation tendencies, but there is no quantitative hallucination metric and invocation is capped, preventing analysis of tendency.

Reviewer 03Rating 6Confidence 4

Strengths

- The dual-agent decomposition is conceptually simple yet effective. The reasoning–tool separation mirrors cognitive modularity and provides a clean interface for information flow. - MSARL-1.5B achieves 55.9 Pass@1, outperforming both larger (7B) and fine-tuned baselines by up to 5.9 points. Ablations show consistent improvements and stable training dynamics. The inclusion of Pass@8 / Maj@8 metrics demonstrates stability gains. - The implementation details (datasets, decoding settings, batch, to

Weaknesses

- All experiments are in mathematical reasoning with code execution. Although the method claims to generalize to “multi-tool” settings, this is not empirically demonstrated. - Dual-agent training introduces additional overhead. The actual wall-clock or compute cost vs. single-agent RL baselines is missing.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Advanced Software Engineering Methodologies