Dynamic Optimizations of LLM Ensembles with Two-Stage Reinforcement Learning Agents
Selim Furkan Tekin, Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Ling Liu

TL;DR
This paper presents RL-Focal, a two-stage reinforcement learning framework that dynamically selects and fuses LLM ensembles for improved task performance and robustness, using novel diversity metrics and adaptive policies.
Contribution
The paper introduces RL-Focal, a novel two-stage RL approach with focal diversity metrics for dynamic LLM ensemble routing and fusion, enhancing performance and robustness.
Findings
Achieves 8.48% performance improvement over best individual LLMs.
Effectively promotes reward-aware and policy-adaptive ensemble selection.
Demonstrates stronger robustness across five benchmarks.
Abstract
The advancement of LLMs and their accessibility have triggered renewed interest in multi-agent reinforcement learning as robust and adaptive frameworks for dynamically changing environments. This paper introduces RL-Focal, a two-stage RL agent framework that routes and ensembles LLMs. First, we develop the Decider RL-agent, which learns to dynamically select an ensemble of small size () among LLMs () for incoming queries from a user-defined downstream task , by maximizing both error-diversity and reasoning-performance of the selected ensemble through iterative updates of task-adaptive rewards and policy. Second, to enable effective fusion of dynamically selected LLMs, we develop the stage-2 Fusion RL-agent, which learns to resolve reasoning conflicts from different LLMs and dynamically adapts to different ensemble teams composed by the Decider Agent for different…
Peer Reviews
Decision·Submitted to ICLR 2026
Interesting two-stage formulation separating selection (Decider) and combination (Fusion), with a multi-agent RL formulation and a centralized critic to stabilize training. Algorithms and training loops are clearly described (Algorithm 1 and 2). Furthermore, the paper attempts cost accounting and shows wall-clock/param comparisons in Appendix E (encouraging effort to quantify cost).
There are some similar RL ensemble approaches which limit the novelty (i.e. RLAE can in effect prune LLMs by lowering weights near zero), although they are formulated differently. The paper motivates RL via online adaptivity, but an explicit demonstration of that advantage would clarify necessity. Furthermore, training which uses two RL policies and a centralized critic adds significant computational overhead over supervised learning methods, though this is perhaps offset by the lower inference
1. Tackles an important, practical problem: adaptive, query-wise ensembling/routing among LLMs rather than static majority voting. 2. Ablations and sensitivity analyses help understand behavior.
1. **Poor writing/formatting.** 1) Lines 50–51 contain content that should not appear in the paper; please remove or rewrite appropriately. 2) The captions/layout for Figure 3 and Figure 4 have almost no spacing, which hurts readability. Please increase the vertical spacing and ensure consistent caption styling. 2. **Overstated novelty in the problem formulation.** The paper claims to be the first to formulate LLM ensembling as a POMDP, yet prior work (e.g., RLAE[A], DER [B]) already models LLM
* Well motivated. * Improvement on BBH seems significant. * Methods are described in detail
The manuscript has substantial room for improvement, particularly in the representation and experimental design. * The manuscript's structure seems unbalanced. Only two of the nine pages are dedicated to describing experimental results. Given the apparent lack of a theoretical contribution, the content devoted to methods and general descriptions should be significantly compressed to allow for a deeper discussion of the findings and ablations. * The current experiments and ablations are limited.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSupply Chain and Inventory Management
