IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Xu Guo; Tianyi Liang; Tong Jian; Xiaogui Yang; Ling-I Wu; Chenhui Li; Zhihui Lu; Qipeng Guo; Kai Chen

arXiv:2508.04632·cs.CL·August 8, 2025

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Xu Guo, Tianyi Liang, Tong Jian, Xiaogui Yang, Ling-I Wu, Chenhui Li, Zhihui Lu, Qipeng Guo, Kai Chen

PDF

1 Models 1 Datasets 3 Reviews

TL;DR

IFDecorator enhances instruction-following reinforcement learning for large language models by improving training efficiency, preventing reward hacking, and ensuring better alignment with user intent through a novel framework.

Contribution

We introduce IFDecorator, a robust framework that wraps RLVR training with components to improve difficulty assessment, intent alignment, and reward hacking detection.

Findings

01

Achieved 87.43% accuracy on IFEval, surpassing larger models.

02

Significantly reduced reward hacking rates with trip wires.

03

Improved performance on FollowBench while maintaining general capabilities.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper designs a framework called IFDecorator that successfully addresses the long-standing challenge of gauging instruction difficulty by leveraging a cooperative-adversarial flywheel. 2. It proposes the IntentCheck and Trip Wires methods, which effectively mitigate over-optimization and reward hacking in RLVR4IF tasks; I find this direction a particularly interesting angle for RLVR-based instruction-following alignment. 3. Models trained with the approach demonstrate good results acro

Weaknesses

1. My central concern is generalization. IFEval and FollowBench consist largely of verifiable instructions, offering limited evidence that the approach will generalize to real-world instructions—especially restrictive role-play prompts that are hard to verify. This raises substantial doubts about the method’s effectiveness on non-verifiable instructions. In addition, several challenging instruction-following benchmarks—such as ComplexBench, Multi-IF, FoFobench and InfoBench—are not covered. 2.

Reviewer 02Rating 6Confidence 3

Strengths

- The idea of decorating RLVR with intent verification plus independent diagnostics is quite original and practical. IntentCheck directly targets the gap between constraint satisfaction and intent fulfillment, and Trip Wires formalize hack probing with HHR. - The paper is well-written with clear binary reward formulation and careful hybrid verification. The motivating examples and framework figure make the failure modes and fixes easy to grasp. - The paper conducts extensive experiments with mul

Weaknesses

- IntentCheck and soft-criteria rely on a judge model. More cross-judge validation would be stronger. - For Trip Wires, human eval shows high precision but only 37.5% recall. - Although Trip Wires are training-independent, repeated evaluation could still invite Goodhart effects

Reviewer 03Rating 6Confidence 3

Strengths

- Addresses a concrete and increasingly relevant issue through a simple, modular solution that integrates easily into existing training pipelines. - Replaces naive constraint-counting with pass-rate–driven adaptive filtering, a principled way to balance challenge and solvability in evolved datasets. - IntentCheck enforces semantic fidelity beyond rule-level correctness, and Trip Wires provide a tangible diagnostic for detecting reward hacking. - Consistently improves instruction-following per

Weaknesses

- Relies on an LLM-driven EXTRACTINTENT step to decompose each instruction into intent, context, input, and constraints, but this process is not validated for accuracy or consistency (e.g., no human agreement or inter-run checks). The paper shows downstream effectiveness (IntentCheck lowers hacking) but does not establish IntentCheck reliability as a decomposition method. - The cooperative–adversarial flywheel depends on empirical pass rates and fixed thresholds (e.g., $\tau_\text{low}=0.0$, $\

Code & Models

Models

🤗
guox18/Qwen2.5-7B-Instruct-IFDecorator
model· 1 dl
1 dl

Datasets

guox18/IFDecorator
dataset· 402 dl
402 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.