AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning
Tevin Wang, Chenyan Xiong

TL;DR
AutoRule automates the extraction of rule-based rewards from human preferences, enhancing reinforcement learning by reducing reward hacking and improving performance on benchmark tasks.
Contribution
AutoRule introduces a fully automated method for extracting and synthesizing rules from preference feedback to improve reward signals in reinforcement learning.
Findings
28.6% improvement in length-controlled win rate on AlpacaEval2.0
6.1% gain in second-turn performance on MT-Bench
Reduced reward hacking compared to learned reward models
Abstract
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on…
Peer Reviews
Decision·Submitted to ICLR 2026
The proposed method has significant gains on AlpacaEval 2.0 and MT-Bench for Llama-3-8B and Olmo-2-7B. The proposed method has substantial implications for the community: it provides explicit, human-readable constraints that explain policy behavior. Open-sourced rules, code, and checkpoints.
The authors argue it is the first fully automated rule-extraction system for RLHF/post-training. However, such pipelines are pretty industrial; therefore, the protocol and engineering efforts might not be innovative for the community. The multi-stage design and instruction-following nature did provide logical transparency of the pipeline, but the paper did not clearly illustrate the merits of such designs. It is conceptually appealing, but I was unable to digest the evidence presented in the ma
* Extracting preference rules over a whole preference dataset is both novel and timely given recent research on rubric based optimization. * AutoRule leads to impressive performance gains in terms of best performance and robustness to overoptimization * The authors conduct thorough evaluations and ablations which robustly demonstrate their claims.
* One potential confounder is that it's unclear how much of the performance gain comes from the utility of the autogenerated rules versus the amount of inference compute spent. Namely, autorule requires doing a forward pass per rubric item. It would be useful to have an inference cost fixed evaluation, potentially by varying the thinking length of a normal llm judge.
* The paper proposes a fully automated pipeline to extract the rules from the data. The amount of manual engineering is pretty nominal * The extracted rules are human interpretable and seem to be aligned with known good practices for llm responses.
* The results are quite marginal and limited. Llama 3 8B is quite old at this point. And the AE LC Win rate is quite low, compared to Llama 3 8B Instruct. * THe results seem to raise a question whether the advantage of rules is mainly effective in out of distribution or extreme scenarios, rather than in distribution (seems contrary as the rules are derived from this distribution). * The conciseness constraint added to the verifier is an implicit design choice. It may bias the model toward shor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Rough Sets and Fuzzy Logic · Data Mining Algorithms and Applications
