Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

Shenao Zhang; Yaqing Wang; Yinxiao Liu; Tianqi Liu; Peter Grabowski; Eugene Ie; Zhaoran Wang; Yunxuan Li

arXiv:2505.20561·cs.LG·December 9, 2025

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces BARL, a Bayesian RL framework that enhances LLM reasoning by promoting reflective exploration, leading to improved performance and efficiency over traditional RL methods.

Contribution

The paper proposes a Bayesian RL approach for LLMs that encourages self-reflection and information gathering, addressing limitations of conventional RL in fostering reflective behaviors.

Findings

01

BARL outperforms traditional RL on reasoning tasks.

02

It achieves higher test accuracy and token efficiency.

03

Empirical results validate the effectiveness of Bayesian reflective exploration.

Abstract

Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as rethinking and error correction, as a form of in-context exploration. However, the Markovian policy obtained from conventional RL training does not give rise to reflective exploration behaviors since the policy depends on the history only through the state and therefore has no incentive to enrich identical states with additional context. Instead, RL exploration is only useful during training to learn the optimal policy in a trial-and-error manner. Therefore, it remains unclear whether reflective reasoning will emerge during RL, or why it is beneficial. To remedy this, we recast reflective exploration within a Bayesian RL framework, which optimizes the expected return under a posterior distribution over Markov decision processes…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 0Confidence 5

Strengths

+ Quality - Strengths - The authors display a nice insight in trying to port over ideas from Bayesian RL into modern LLM fine-tuning. - Weaknesses * Major - In RL (or what the authors seem keen to refer to as "conventional/Markovian RL"), there is no such distinction between training time and testing time as there is in standard supervised learning. There is just one single, periodic stream of an agent interacting with an environment episode after episode. Yet, throughout the paper, the

Weaknesses

Please see above.

Reviewer 02Rating 8Confidence 4

Strengths

1. Clearly articulates the limitation of standard Markovian RL—its inability to support test-time reflective exploration—and provides theoretical justification for why this hampers generalization. 2. The theoretical framework is very solid, which Introduces Bayes-Adaptive MDPs to model LLM reasoning, formalizing test-time generalization as maximizing expected return under a posterior over candidate tasks, grounding exploration in principled Bayesian principles. 3. Authors present novel BARL, w

Weaknesses

1. Computational Overhead: Despite KV-cache reuse, the per-step computation still scales linearly with the number of sampled hypotheses (|M|) and grows rapidly when the model size or context length increases; no GPU-hours or throughput curves are reported to quantify this burden. 2. Keyword-based Reflection Detection: Relying on fixed trigger words to flag “self-reflection” is unreliable—it captures only explicit surface signals and cannot verify whether the model actually revises its reasoning

Reviewer 03Rating 8Confidence 3

Strengths

- Motivations for paper is clear and paper is well-written - Formulation is easy-to-follow and small-scale experiments are helpful in understanding the effectiveness of the approach - Strong improvements in token efficiency and slight improvements over GRPO in math-reasoning tasks.

Weaknesses

- Ablation of progress reward. The provided formulation assumes reward is based on some gold action (e.g. answer token y) and this is used to introduce a progress reward incentivizing the model to think. However, the effect of this reward does not seem to ablated. Additionally, there are domains where reward is not a function of a gold action (e.g. for agentic coding domains, reward may be computed by executing sampled code against some test cases). Does this limit the generality of this approac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Semantic Web and Ontologies