TL;DR
BOTS introduces a Bayesian framework for adaptive task selection in LLM reinforcement finetuning, improving data efficiency and model performance by balancing exploration and exploitation without significant additional computation.
Contribution
It presents a novel Bayesian approach with implicit and explicit evidence integration for dynamic task selection, reducing overhead and enhancing finetuning effectiveness.
Findings
Consistently improves data efficiency across domains
Enhances model performance over baseline methods
Operates with negligible additional computational overhead
Abstract
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between…
Peer Reviews
Decision·ICLR 2026 Poster
1. The studied online task selection problem is crucial for the LLM reinforcement finetuning problem, and BOTS's formulation of it as a unified Bayesian inference problem is elegant and well-suited for the task. 2. BOTS avoids the need for additional model rollouts during training, and the computational overhead is shown to be negligible ($\le 0.2\%$) while consistently showing performance gains over random selection, making the framework efficient and effective in adoption. 3. The idea of tryin
1. Although quite efficient, the linear interpolation function in BOTS may be weak as an instantiation of $\tilde{p}(k, \mathcal{B}_t)$. Results in Figure 5 show that the Pearson correlation between estimated and empirical difficulties is weak, and the ROC AUC is also not good enough (especially for Qwen2.5-1.5B-Instruct). And there is no theoretical analysis to justify why a linear interpolation alone can adequately capture the complex information needed for task difficulty estimation. 2. The t
1. The paper is well-written and easy to follow. 2. The proposed idea is interesting and novel. 3. The empirical results demonstrates the effectiveness of the proposed method.
1. The proposed method is only evaluated on binary rewards. It is not clear whether it is applicable on more complex tasks. 2. The comparisons are internal. Lack of comparisons with other alternatives. 3. There are some related work on task selection are missed: [1] DATS: Difficulty-Aware Task Sampler for Meta-Learning Physics-Informed Neural Networks. [2] Model predictive task sampling for efficient and robust adaptation
- Provides a principled Bayesian formulation of online task selection that generalizes prior methods. - Fuses explicit and implicit evidence to improve stability and cold-start performance. - Includes ablations on $\lambda$ (forgetting) and $\rho$ (evidence fusion), providing good insight into the method’s behavior. - The interpolation-based implicit evidence is computationally efficient (<0.2% overall overhead).
- Experiments are limited to two LLM scales (1.5B and 7B) of Qwen2.5 on a single benchmark (GURU), raising concerns about generalizability. - The reported improvements over baselines, appear modest and sometimes inconsistent across domains and target fractions and no confidence intervals or statistical tests are provided. - The method assumes access to reference models for implicit evidence, the practicality of this in new domains is not fully addressed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
