RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu; Jiaqi Li; Zilong Zheng

arXiv:2506.08672·cs.CL·February 17, 2026

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu, Jiaqi Li, Zilong Zheng

PDF

Open Access 1 Repo 2 Models 1 Datasets 3 Reviews

TL;DR

RuleReasoner introduces a domain-aware dynamic sampling method in reinforcement learning to improve rule-based reasoning across diverse rule formats, achieving superior accuracy and efficiency on multiple benchmarks.

Contribution

It proposes a novel domain-aware dynamic sampling approach in RL for rule reasoning, addressing variability in rule formats and outperforming existing large reasoning models.

Findings

01

Outperforms frontier LRMs by 4.1% on ID tasks

02

Achieves 10.4% improvement on OOD tasks

03

Demonstrates higher computational efficiency

Abstract

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- Impressive empirical performance - New logical rule data that would be helpful for future research

Weaknesses

- Mathematical notations should be improved overall: - I believe that this notation is the clearest: $\mathcal{D}$ represents a fixed (offline) collection of $(d, q, r, y)$, where $d \in \\\{d_1, \cdots, d_n\\\}$ represents the domain. (In Algorithm 1, it states that $\mathcal{D} = \\\{d_1, \cdots, d_n\\\}$, which seemd a bit weird and contradictory to prior notations) - In Algorithm 1, initializing $\tilde{r}_{0, d_i} \gets 0$ is missing; for notational clarity, $m$ should be written as $

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper introduces a dynamic sampling approach to stabilize RL training across imbalanced domains, which is a practical consideration. 2. RULEREASONER achieves quantitative performance improvements.

Weaknesses

1. Lack of comparison with existing adaptive sampling methods: Since the core contribution of this paper appears to be the "domain-aware dynamic sampling", it is better to include prior curriculum learning or adaptive sampling methods for comparison. 2. It can be seen from Table 2 that RL-based methods show relative low performance on LogicNLI, AR-LSAT compared to SFT w/ CoT methods. Could authors explain this phenomenon? Besides, according to the avg results, the performance of models after SFT

Reviewer 03Rating 8Confidence 4

Strengths

* A large collection of datasets for rule reasoning of varying formats difficulties/depth, inference types, etc. The experiments demonstrate that these are useful for training. * The paper proposes a domain-adaptive algorithm for data sampling for RLVR, named DADS, which modifies the reward to be domain-normalized. The rewards per domain are also tracked and used for sampling data in the next batch. * Comprehensive empirical results on a large suite of both logical inference and mathematical rea

Weaknesses

These are minor suggestions. I did not find major weaknesses given this topic and area. * The evaluations and collection is ultimately limited by the available data from other logical reasoning datasets; the authors did not collect or generate their own. As a result, they could be missing some rule reasoning types or be over-indexed on some. While dynamic sampling addresses the latter, it cannot fill gaps in missing reasoning types. * Thanks for including the case studies (and C.2) – however th

Code & Models

Repositories

bigai-nlco/rulereasoner
pytorchOfficial

Models

Datasets

RuleReasoner/RuleCollection-32K
dataset· 275 dl
275 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning