$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

Ryozo Masukawa; Sanggeon Yun; Hyunwoo Oh; SuhgHeon Jeong; Raheeb Hassa; Hanning Chen; Wenjun Huang; Mahdi Imani; Pietro Mercati; Nathaniel D. Bastian; Mohsen Imani

arXiv:2602.09173·cs.LG·February 11, 2026

$n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

Ryozo Masukawa, Sanggeon Yun, Hyunwoo Oh, SuhgHeon Jeong, Raheeb Hassa, Hanning Chen, Wenjun Huang, Mahdi Imani, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani

PDF

Open Access

TL;DR

This paper introduces a method for integrating multiple specialized language models through their internal states using trainable attention, enabling effective collaboration and reasoning without large monolithic models, and reveals how expert attention evolves during training.

Contribution

It proposes soft hidden-state collaboration for integrating frozen language model experts via attention, providing a new mechanism for structured reasoning and insights into expert utilization patterns.

Findings

01

Competitive performance with strong RLVR baselines on reasoning tasks

02

Emergent specialization in expert attention during training

03

Static preferences explain simpler task performance, dynamic attention for complex tasks

Abstract

Recent progress in reinforcement learning with verifiable rewards (RLVR) shows that small, specialized language models (SLMs) can exhibit structured reasoning without relying on large monolithic LLMs. We introduce soft hidden-state collaboration, where multiple heterogeneous frozen SLM experts are integrated through their internal representations via a trainable attention interface. Experiments on Reasoning Gym and GSM8K show that this latent integration is competitive with strong single-model RLVR baselines. Ablations further reveal a dual mechanism of expert utilization: for simpler arithmetic domains, performance gains can largely be explained by static expert preferences, whereas more challenging settings induce increasingly concentrated and structured expert attention over training, indicating emergent specialization in how the router connects to relevant experts. Overall,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)