IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen

TL;DR
IFDecorator enhances instruction-following reinforcement learning for large language models by improving training efficiency, preventing reward hacking, and ensuring better alignment with user intent through a novel framework.
Contribution
We introduce IFDecorator, a robust framework that wraps RLVR training with components to improve difficulty assessment, intent alignment, and reward hacking detection.
Findings
Achieved 87.43% accuracy on IFEval, surpassing larger models.
Significantly reduced reward hacking rates with trip wires.
Improved performance on FollowBench while maintaining general capabilities.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper designs a framework called IFDecorator that successfully addresses the long-standing challenge of gauging instruction difficulty by leveraging a cooperative-adversarial flywheel. 2. It proposes the IntentCheck and Trip Wires methods, which effectively mitigate over-optimization and reward hacking in RLVR4IF tasks; I find this direction a particularly interesting angle for RLVR-based instruction-following alignment. 3. Models trained with the approach demonstrate good results acro
1. My central concern is generalization. IFEval and FollowBench consist largely of verifiable instructions, offering limited evidence that the approach will generalize to real-world instructions—especially restrictive role-play prompts that are hard to verify. This raises substantial doubts about the method’s effectiveness on non-verifiable instructions. In addition, several challenging instruction-following benchmarks—such as ComplexBench, Multi-IF, FoFobench and InfoBench—are not covered. 2.
- The idea of decorating RLVR with intent verification plus independent diagnostics is quite original and practical. IntentCheck directly targets the gap between constraint satisfaction and intent fulfillment, and Trip Wires formalize hack probing with HHR. - The paper is well-written with clear binary reward formulation and careful hybrid verification. The motivating examples and framework figure make the failure modes and fixes easy to grasp. - The paper conducts extensive experiments with mul
- IntentCheck and soft-criteria rely on a judge model. More cross-judge validation would be stronger. - For Trip Wires, human eval shows high precision but only 37.5% recall. - Although Trip Wires are training-independent, repeated evaluation could still invite Goodhart effects
- Addresses a concrete and increasingly relevant issue through a simple, modular solution that integrates easily into existing training pipelines. - Replaces naive constraint-counting with pass-rate–driven adaptive filtering, a principled way to balance challenge and solvability in evolved datasets. - IntentCheck enforces semantic fidelity beyond rule-level correctness, and Trip Wires provide a tangible diagnostic for detecting reward hacking. - Consistently improves instruction-following per
- Relies on an LLM-driven EXTRACTINTENT step to decompose each instruction into intent, context, input, and constraints, but this process is not validated for accuracy or consistency (e.g., no human agreement or inter-run checks). The paper shows downstream effectiveness (IntentCheck lowers hacking) but does not establish IntentCheck reliability as a decomposition method. - The cooperative–adversarial flywheel depends on empirical pass rates and fixed thresholds (e.g., $\tau_\text{low}=0.0$, $\
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
