Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Xiaoze Liu, Dhananjay Ram, Yuting Zhang, Zhaoyang Zhang, Wei Xia, Stefano Soatto

TL;DR
This paper proposes a novel framework called Mutual Reinforcement Learning for concurrent training of heterogeneous language models, enabling experience sharing across different model architectures and vocabularies.
Contribution
It introduces a comprehensive experience-sharing framework with modules for heterogeneous tokenization, resource allocation, and experience exchange, demonstrated through three specific sharing mechanisms.
Findings
Outcome-level sharing offers the best stability-support trade-off.
The framework effectively aligns experiences across incompatible vocabularies.
Different sharing strategies impact model stability and success transfer.
Abstract
We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
