Multi-timestep models for Model-based Reinforcement Learning

Abdelhakim Benechehab; Giuseppe Paolo; Albert Thomas; Maurizio; Filippone; Bal\'azs K\'egl

arXiv:2310.05672·cs.LG·October 12, 2023

Multi-timestep models for Model-based Reinforcement Learning

Abdelhakim Benechehab, Giuseppe Paolo, Albert Thomas, Maurizio, Filippone, Bal\'azs K\'egl

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a multi-timestep training objective for one-step dynamics models in model-based reinforcement learning, significantly improving long-horizon predictions and robustness in noisy environments.

Contribution

It proposes a novel multi-timestep training approach with weighted loss functions, enhancing long-term prediction accuracy and performance in noisy, real-world scenarios.

Findings

01

Multi-timestep models outperform standard models in long-horizon predictions.

02

Exponentially decaying weights improve long-horizon R2 scores.

03

Models perform better in noisy environments, demonstrating robustness.

Abstract

In model-based reinforcement learning (MBRL), most algorithms rely on simulating trajectories from one-step dynamics models learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as length of the trajectory grows. In this paper we tackle this issue by using a multi-timestep objective to train one-step models. Our objective is a weighted sum of a loss function (e.g., negative log-likelihood) at various future horizons. We explore and test a range of weights profiles. We find that exponentially decaying weights lead to models that significantly improve the long-horizon R2 score. This improvement is particularly noticeable when the models were evaluated on noisy data. Finally, using a soft actor-critic (SAC) agent in pure batch reinforcement learning (RL) and iterated batch RL scenarios, we found that our multi-timestep models outperform or…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 1· strong rejectConfidence 4

Strengths

The paper is nicely written and very easy to follow.

Weaknesses

The paper has two severe weaknesses, first the proposed approach has been evaluated multiple times and second the experimental evaluation is very limited. 1) Multi-step Losses: If I understand the proposed multistep loss correctly, this multistep loss has been proposed and utilized very often. For example, see the references [1-4] and there are many more. I am quite certain that one could even go back to the older system identification literature that talks about the multi-step loss for linear

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- The use of the R2 score for evaluation of the prediction accuracy was nice, as it provides an interpretable metric. - The literature review and discussion were OK.

Weaknesses

- The experimental results are not substantial. The method is only demonstrated on cart-pole, and there is no statistically significant improvement. I am not convinced the method works effectively. - In some of the datasets, the data is generated from a fixed policy, and the one-step model is used to predict the state at time step $t+h$, by sequentially applying the actions $a_t, a_{t+1}, a_{t+2}$, etc. that were applied in the rollout. In practice, the actions may also be correlated with the s

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

Despite the severe limitations of the paper, the following are positive points about its perspective on model-based reinforcement learning: - The problem of finding better loss functions for training models of the dynamics, considering the final use that the reinforcement learning algorithm will make of these models, is important and relevant to the community - I find the approach based on weighting different prediction horizons in a different way to be promising.

Weaknesses

Unfortunately, I believe that the current iteration of the paper lacks a sufficient level of rigor for the contribution to be ready for publication: - Despite the paper says this is a limitation, I believe the fact that the study is only conducted using a single, extremely simple, environment reduces the scope of the paper to be so small to be irrelevant. I encourage the authors to consider a larger suite of benchmarks, (e.g., MuJoCo, Brax, Myriad, MinAtar, Atari), picking the one that best suit

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Sports Analytics and Performance