# Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions

**Authors:** Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He

arXiv: 2508.21188 · 2025-09-03

## TL;DR

This paper investigates how the alignment between pretrained models and tasks influences the outcomes of reinforcement learning, revealing that many counterintuitive phenomena occur mainly when models are already well-aligned with the task.

## Contribution

It systematically analyzes the role of model-task alignment in RL outcomes, clarifying when surprising results are observed versus when standard RL remains effective.

## Key findings

- Counterintuitive RL phenomena occur mainly with strong model-task alignment.
- Standard RL methods are robust across different settings.
- Many phenomena fail in challenging regimes without strong alignment.

## Abstract

Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21188/full.md

## Figures

15 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21188/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/2508.21188/full.md

---
Source: https://tomesphere.com/paper/2508.21188