Automatic Reward Shaping from Multi-Objective Human Heuristics
Yuqing Xie, Jiayu Chen, Wenhao Tang, Ya Zhang, Chao Yu, Yu Wang

TL;DR
This paper introduces MORSE, a framework that automatically combines multiple human heuristics into a reward function for reinforcement learning, improving multi-objective task performance without manual tuning.
Contribution
MORSE formulates reward shaping as a bi-level optimization problem and incorporates stochastic exploration to effectively balance multiple objectives in RL tasks.
Findings
MORSE achieves performance comparable to manually tuned rewards.
It effectively balances multiple objectives in robotic tasks.
The framework improves exploration and avoids local minima.
Abstract
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper addresses a critical bottleneck in RL for robotics: the labor-intensive and error-prone process of manual reward function design - Framing the problem as a bi-level optimization is a good approach (also used in other works) - The use of synthetic 2D optimization functions to isolate and test the outer-loop exploration strategy (Sec 6.1) is a good experimental design concept and the motivating example was clear
- Tables 1 and 2 report means without any confidence intervals or error bars, making it impossible to validate the significance of the results - The paper is a methodology paper but is only tested on locomotion tasks. This is not a robotics paper. The method's generality is unproven, and it should have been tested on a broader set of environments - The experiments use tasks with only 2 or 3 heuristics, despite motivating the problem with a 15-heuristic example (Margolis and Agrawal). The method'
- The paper is easy to follow and provides sufficient details for reproducibility. The authors also release their code. - The authors address an important task in RL, finding reward weights for multiple reward terms is a difficult and time-consuming task, and at the same time, it is crucial for successful training. - The authors run extensive validation on synthetic examples as well as popular benchmarks, and they provide ablation studies for their design decisions.
- The paper's primary weakness lies in its comparison to existing methods. For me, it is not fully clear what the precise differences and similarities to traditional Multi-Objective Reinforcement Learning (MORL) are. The authors claim to solve a different task, but it is plausible that a standard MORL formulation could serve as a strong baseline with minor modifications. A direct comparison is currently missing. For instance, the "Gradient w/ Reset" baseline is intuitively similar to some MORL t
1. The motivation is clear and realistic: reward shaping is indeed a bottleneck in RL, and automating it is a useful and practical direction. 2. The paper is well motivated and the proposed method is technically sound overall, combining bi-level optimization and exploration in a novel way. 3. The approach is close to practice, since many real tasks combine a sparse goal reward with auxiliary shaping terms. 4. The ablation studies are valuable, helping clarify which design elements (exploratio
1. The experiments are restricted to simple locomotion tasks with only 2–3 heuristic components. This is far from realistic use cases with many interdependent objectives. The study would be stronger if it included manipulation or visual tasks (for instance, environments from RLBench, James et al., 2020). 2. While using novelty helps, relying solely on RND is not fully intuitive for reward-space exploration. Methods such as Bayesian Optimization could provide a more principled trade-off between
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Autonomous Vehicle Technology and Safety
