Flexible Empowerment at Reasoning with Extended Best-of-N Sampling
Taisuke Kobayashi

TL;DR
This paper introduces an extended Best-of-N sampling method incorporating empowerment to enhance exploration-exploitation balance in reinforcement learning, enabling flexible and efficient policy adjustments.
Contribution
It proposes a novel BoN sampling extension using Tsalis statistics to adjust empowerment-driven exploration without explicit policy learning.
Findings
The method effectively balances exploration and exploitation in toy problems.
It improves RL performance on complex locomotion tasks.
Abstract
This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
