Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker, Anastasiia Pedan, Arash Ahmadian, Romina Abachi, Igor Gilitschenski, Amir-massoud Farahmand

TL;DR
This paper analyzes value-aware model learning losses like MuZero, revealing their calibration issues, and proposes corrections to improve the accuracy of model and value function recovery in reinforcement learning.
Contribution
It provides a theoretical analysis of value-aware losses, identifies calibration problems, and introduces corrections to enhance model learning in reinforcement learning.
Findings
Value-aware losses are uncalibrated surrogate losses.
Calibrated stochastic models can outperform deterministic ones.
Proposed corrections improve value and model accuracy.
Abstract
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Residual Connection · Monte-Carlo Tree Search · Residual Block · Convolution · Average Pooling · Prioritized Experience Replay · MuZero
