TL;DR
This paper introduces Occupancy Reward Shaping, a method that uses world models and optimal transport to improve credit assignment in goal-conditioned reinforcement learning, especially in sparse reward scenarios.
Contribution
It formalizes how temporal information in world models encodes environment geometry and leverages this for reward shaping without altering optimal policies.
Findings
Empirically improves performance by 2.2x on 13 tasks.
Effectively mitigates credit assignment issues in sparse reward settings.
Successfully applied to real-world nuclear fusion control tasks.
Abstract
The temporal lag between actions and their long-term consequences makes credit assignment a challenge when learning goal-directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal-reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
