Flexible Empowerment at Reasoning with Extended Best-of-N Sampling

Taisuke Kobayashi

arXiv:2604.15614·cs.LG·April 20, 2026

Flexible Empowerment at Reasoning with Extended Best-of-N Sampling

Taisuke Kobayashi

PDF

TL;DR

This paper introduces an extended Best-of-N sampling method incorporating empowerment to enhance exploration-exploitation balance in reinforcement learning, enabling flexible and efficient policy adjustments.

Contribution

It proposes a novel BoN sampling extension using Tsalis statistics to adjust empowerment-driven exploration without explicit policy learning.

Findings

01

The method effectively balances exploration and exploitation in toy problems.

02

It improves RL performance on complex locomotion tasks.

Abstract

This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.