Augmented Bayesian Policy Search

Mahdi Kallel; Debabrota Basu; Riad Akrour; Carlo D'Eramo

arXiv:2407.04864·cs.LG·July 9, 2024

Augmented Bayesian Policy Search

Mahdi Kallel, Debabrota Basu, Riad Akrour, Carlo D'Eramo

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces Augmented Bayesian Search (ABS), a novel method that combines Bayesian Optimization with policy gradients to improve deterministic policy search in reinforcement learning, especially for high-dimensional locomotion tasks.

Contribution

The paper proposes a new mean function for Bayesian Optimization that incorporates the action-value function, bridging BO and policy gradient methods for scalable reinforcement learning.

Findings

01

ABS performs competitively on high-dimensional locomotion tasks.

02

The method effectively combines the advantages of BO and policy gradients.

03

Experimental results show improved exploration and policy quality.

Abstract

Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS).…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The idea of using the recent local Bayesian optimization methods for efficient policy search for high dimensional problems is quite interesting. Furthermore, it is well integrated with the performance difference lemma to enable updates on the information of theta with queries at x. I can see a lot of potential of this work in the community.

Weaknesses

There are some limitations that can reduce the applicability of this work and make it difficult to understand. For example, it is never explained the type of policy being used. The derivation is based on nonparametric (tabular) policies, but those type of policies are too limited for the MuJoCo experiments. Even the linear policy seems to be very limited. The idea of incorporating information from x at theta is interesting, but that still requires to sample rollouts from the policy pi_x, which s

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

* The approach is interesting, I think bootstrapping-free approaches should be given more attention.

Weaknesses

* There is no comparison with one of the frequently used model-free RL methods with bootstrapping. * The presentation does not classify the approach clearly enough in the very broad field of RL algorithms. * The results do not show a clear superiority over the two comparative methods. It is not made sufficiently clear why this approach should nevertheless receive attention. * The method is only tested on deterministic MDPs without mentioning this limitation. Further comments: * The term "deter

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

- the paper is well written, contributions are highlighted, clear experimental questions - while not easy to follow (one needs to have expertise in many different areas: model-free RL, (local) Bayesian Optimization, Gaussian processes) the author explain their reasoning well - the experiments are carried out rigourosly, including repetitions, definition of research questions, etc - the finding is novel: utilizing the performance difference lemma to derive a more informed mean function to have a

Weaknesses

1. minor: The plots can be improved, figure 2-4 is not very pretty (too high linewidth, scrollbar on the right, grid overrides plot) 2. Performance is only marginally better/worse than existing methods 3. This is not a weakness/critique directly but it relates to point 2. Also this point will be a bit opinionated. I do not believe that the "classical RL setup" of: exploration towards exploitation(i.e. gradually improving/figuring out the task) is where this method is applied best. Essentiall

Videos

Augmented Bayesian Policy Search· slideslive

Taxonomy

TopicsData Quality and Management