TL;DR
This paper introduces MEDS, a memory-enhanced reward shaping method that uses historical behavioral data to penalize recurring errors, thereby improving diversity and performance in reinforcement learning for language models.
Contribution
MEDS is a novel framework that incorporates past behavioral signals into reward design to reduce repeated mistakes and enhance exploration in language model training.
Findings
MEDS improves performance by up to 4.13 pass@1 points.
MEDS increases behavioral diversity during sampling.
Consistent gains across five datasets and three models.
Abstract
Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
