RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
Yang Liu, Jiaqi Li, Zilong Zheng

TL;DR
RuleReasoner introduces a domain-aware dynamic sampling method in reinforcement learning to improve rule-based reasoning across diverse rule formats, achieving superior accuracy and efficiency on multiple benchmarks.
Contribution
It proposes a novel domain-aware dynamic sampling approach in RL for rule reasoning, addressing variability in rule formats and outperforming existing large reasoning models.
Findings
Outperforms frontier LRMs by 4.1% on ID tasks
Achieves 10.4% improvement on OOD tasks
Demonstrates higher computational efficiency
Abstract
Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a…
Peer Reviews
Decision·ICLR 2026 Poster
- Impressive empirical performance - New logical rule data that would be helpful for future research
- Mathematical notations should be improved overall: - I believe that this notation is the clearest: $\mathcal{D}$ represents a fixed (offline) collection of $(d, q, r, y)$, where $d \in \\\{d_1, \cdots, d_n\\\}$ represents the domain. (In Algorithm 1, it states that $\mathcal{D} = \\\{d_1, \cdots, d_n\\\}$, which seemd a bit weird and contradictory to prior notations) - In Algorithm 1, initializing $\tilde{r}_{0, d_i} \gets 0$ is missing; for notational clarity, $m$ should be written as $
1. This paper introduces a dynamic sampling approach to stabilize RL training across imbalanced domains, which is a practical consideration. 2. RULEREASONER achieves quantitative performance improvements.
1. Lack of comparison with existing adaptive sampling methods: Since the core contribution of this paper appears to be the "domain-aware dynamic sampling", it is better to include prior curriculum learning or adaptive sampling methods for comparison. 2. It can be seen from Table 2 that RL-based methods show relative low performance on LogicNLI, AR-LSAT compared to SFT w/ CoT methods. Could authors explain this phenomenon? Besides, according to the avg results, the performance of models after SFT
* A large collection of datasets for rule reasoning of varying formats difficulties/depth, inference types, etc. The experiments demonstrate that these are useful for training. * The paper proposes a domain-adaptive algorithm for data sampling for RLVR, named DADS, which modifies the reward to be domain-normalized. The rewards per domain are also tracked and used for sampling data in the next batch. * Comprehensive empirical results on a large suite of both logical inference and mathematical rea
These are minor suggestions. I did not find major weaknesses given this topic and area. * The evaluations and collection is ultimately limited by the available data from other logical reasoning datasets; the authors did not collect or generate their own. As a result, they could be missing some rule reasoning types or be over-indexed on some. While dynamic sampling addresses the latter, it cannot fill gaps in missing reasoning types. * Thanks for including the case studies (and C.2) – however th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
