Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
Chi-Chang Lee, Zhang-Wei Hong, Pulkit Agrawal

TL;DR
This paper introduces HEPO, a new reinforcement learning framework that effectively leverages heuristics to improve policy performance while mitigating reward hacking, reducing human effort in reward design.
Contribution
HEPO is a novel policy optimization method that maximizes policy improvement using heuristics, outperforming prior approaches and working well even with poorly designed heuristics.
Findings
HEPO outperforms existing methods on standard benchmarks.
HEPO maintains high performance with non-expert heuristic design.
HEPO reduces human effort in reward engineering.
Abstract
In many reinforcement learning (RL) applications, augmenting the task rewards with heuristic rewards that encode human priors about how a task should be solved is crucial for achieving desirable performance. However, because such heuristics are usually not optimal, much human effort and computational resources are wasted in carefully balancing tasks and heuristic rewards. Theoretically rigorous ways of incorporating heuristics rely on the idea of \textit{policy invariance}, which guarantees that the performance of a policy obtained by maximizing heuristic rewards is the same as the optimal policy with respect to the task reward. However, in practice, policy invariance doesn't result in policy improvement, and such methods are known to empirically perform poorly. We propose a new paradigm to mitigate reward hacking and effectively use heuristics based on the practical goal of maximizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Ethics and Social Impacts of AI
