Multi-timestep models for Model-based Reinforcement Learning
Abdelhakim Benechehab, Giuseppe Paolo, Albert Thomas, Maurizio, Filippone, Bal\'azs K\'egl

TL;DR
This paper introduces a multi-timestep training objective for one-step dynamics models in model-based reinforcement learning, significantly improving long-horizon predictions and robustness in noisy environments.
Contribution
It proposes a novel multi-timestep training approach with weighted loss functions, enhancing long-term prediction accuracy and performance in noisy, real-world scenarios.
Findings
Multi-timestep models outperform standard models in long-horizon predictions.
Exponentially decaying weights improve long-horizon R2 scores.
Models perform better in noisy environments, demonstrating robustness.
Abstract
In model-based reinforcement learning (MBRL), most algorithms rely on simulating trajectories from one-step dynamics models learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as length of the trajectory grows. In this paper we tackle this issue by using a multi-timestep objective to train one-step models. Our objective is a weighted sum of a loss function (e.g., negative log-likelihood) at various future horizons. We explore and test a range of weights profiles. We find that exponentially decaying weights lead to models that significantly improve the long-horizon R2 score. This improvement is particularly noticeable when the models were evaluated on noisy data. Finally, using a soft actor-critic (SAC) agent in pure batch reinforcement learning (RL) and iterated batch RL scenarios, we found that our multi-timestep models outperform or…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is nicely written and very easy to follow.
The paper has two severe weaknesses, first the proposed approach has been evaluated multiple times and second the experimental evaluation is very limited. 1) Multi-step Losses: If I understand the proposed multistep loss correctly, this multistep loss has been proposed and utilized very often. For example, see the references [1-4] and there are many more. I am quite certain that one could even go back to the older system identification literature that talks about the multi-step loss for linear
- The use of the R2 score for evaluation of the prediction accuracy was nice, as it provides an interpretable metric. - The literature review and discussion were OK.
- The experimental results are not substantial. The method is only demonstrated on cart-pole, and there is no statistically significant improvement. I am not convinced the method works effectively. - In some of the datasets, the data is generated from a fixed policy, and the one-step model is used to predict the state at time step $t+h$, by sequentially applying the actions $a_t, a_{t+1}, a_{t+2}$, etc. that were applied in the rollout. In practice, the actions may also be correlated with the s
Despite the severe limitations of the paper, the following are positive points about its perspective on model-based reinforcement learning: - The problem of finding better loss functions for training models of the dynamics, considering the final use that the reinforcement learning algorithm will make of these models, is important and relevant to the community - I find the approach based on weighting different prediction horizons in a different way to be promising.
Unfortunately, I believe that the current iteration of the paper lacks a sufficient level of rigor for the contribution to be ready for publication: - Despite the paper says this is a limitation, I believe the fact that the study is only conducted using a single, extremely simple, environment reduces the scope of the paper to be so small to be irrelevant. I encourage the authors to consider a larger suite of benchmarks, (e.g., MuJoCo, Brax, Myriad, MinAtar, Atari), picking the one that best suit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Sports Analytics and Performance
