TL;DR
This paper introduces MeRF, a simple method that enhances reinforcement finetuning of large reasoning models by injecting reward information into prompts, leading to improved performance and better alignment with reward functions.
Contribution
The paper proposes MeRF, a novel approach that leverages in-context motivation by explicitly including reward specifications in prompts during reinforcement finetuning of LLMs.
Findings
MeRF outperforms baseline RLVR in empirical evaluations.
Performance improves with higher consistency between motivation and reward.
Model can adapt to misleading motivations through finetuning.
Abstract
Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce…
Peer Reviews
Decision·ICLR 2026 Poster
1. The approach creatively combines in-context learning with RL by explicitly providing reward rules as "motivation," offering a simple extension to existing RLVR paradigms that could inspire hybrid training methods. 2. Experiments cover multiple models (e.g., Qwen2.5 series) , with consistent comparisons to baselines, providing some evidence of improved accuracy and efficiency. 3. The paper is well-structured, with clear illustrations of the method, prompts, and results, making the core idea
1. The method is overly simplistic and lacks rigorous theoretical justification; it's unclear how the specific reward scoring rules (e.g., +2 for correctness, -1.5 for understandable but wrong answers) mechanistically influence the model's generation of correct reasoning trajectories, relying too much on intuition without deeper analysis. 2. Extensive experimental data is provided mainly for logic puzzles, but for more general tasks like mathematics and code generation, the motivation descripti
Simple, well-motivated idea that's easy to implement; the paper reads clearly. Consistent improvements over RLVR across two model families (Qwen2.5, DeepSeek-R1-Distill) and multiple reasoning benchmarks; importantly, performance holds without motivation at test time. The method achieves better performance in fewer training steps. For example, in one experiment, MeRF achieved better pass@4 and pass@8 performance at step 140 than the final RLVR model did at step 280.
Currently the MeRF variant is compared only to the RLVR. Given the nature of MeRF consists of injecting the reward in the instruction, consider comparing against tuned-prompt variants via DSPY (https://github.com/stanfordnlp/dspy) , to see whether this benefit comes from better prompting. The method's effectiveness is tied to tasks where the reward function is verifiable and describable in simple natural language. This limits the scope of MeRF, making it unclear how it would apply to tasks with
- A novel, simple and very practical approach to improve RLVR, which also makes sense - Interesting experimental design and results on Q4 - Well presented (in terms of design) to make the paper easy to read
- The experimental results are scattered around the paper and somehow do not seem complete: - Figure 1 includes results on 4 different LLMs and Figure 3 includes result on deepseek but most of them not presented in table 1. - Results on Figure 2 (right) have no details; What dataset is this? - Figure 1, 3, 7, 6, 5, 8 all show increasing performance on steps, but differently grouped (some on metrics, some on datasets), and feels very repetitive, being scattered all around the paper. Need to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
