Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang

TL;DR
This paper introduces LoPE, a perturbation method using Lorem Ipsum text to enhance reasoning exploration in large language models during reinforcement learning.
Contribution
LoPE employs prompt-space perturbations with pseudo-Latin text to improve exploration and performance in LLM reinforcement learning tasks.
Findings
LoPE outperforms traditional resampling methods across multiple model sizes.
Latin-based random sequences with low perplexity are effective perturbations.
LoPE establishes a new baseline for exploration in LLM reinforcement learning.
Abstract
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
