Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
Ju-Seung Byun, Andrew Perrault

TL;DR
This paper introduces a symmetric reinforcement learning loss inspired by supervised learning techniques, significantly improving training stability and performance across diverse tasks, model scales, and feedback scenarios.
Contribution
It adapts the reverse cross entropy loss to reinforcement learning, enhancing robustness and stability, especially in noisy data and large language model fine-tuning.
Findings
Improved performance in Atari, MuJoCo, and Box2D tasks.
Enhanced RLHF results in language models for sentiment and summarization.
Notable stability gains with symmetric loss across hyperparameters.
Abstract
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsA2C · Entropy Regularization · Proximal Policy Optimization
