Double Horizon Model-Based Policy Optimization
Akihiro Kubo, Paavo Parmas, Shin Ishii

TL;DR
The paper introduces DHMBPO, a novel model-based reinforcement learning method that employs two different rollout horizons to better balance bias, variance, and distribution shift, leading to improved efficiency and stability.
Contribution
It proposes a double-horizon approach dividing rollouts into distribution and training phases, addressing conflicting optimal horizons in model-based RL.
Findings
Outperforms existing MBRL methods on continuous-control benchmarks.
Achieves higher sample efficiency and lower runtime.
Effectively balances distribution shift, model bias, and gradient variance.
Abstract
Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Model Reduction and Neural Networks
