Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^{\pi}$ Realizability for Deterministic Dynamics

Yijing Ke; Zihan Zhang; Ruosong Wang

arXiv:2603.00716·cs.LG·March 3, 2026

Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^{\pi}$ Realizability for Deterministic Dynamics

Yijing Ke, Zihan Zhang, Ruosong Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Frozen Policy Iteration, an efficient online reinforcement learning algorithm for deterministic dynamics with linear $Q^{ heta}$ realizability, achieving optimal regret bounds and overcoming previous computational and access limitations.

Contribution

We propose a novel, computationally efficient RL algorithm that works with stochastic initial states and deterministic transitions, using on-policy data and extending to function classes with bounded eluder dimension.

Findings

01

Achieves regret bound of $ ilde{O}( oot{2}d^2H^6T)$, optimal for linear bandits.

02

Circumvents simulator reliance by using high-confidence trajectory data.

03

Extends to Uniform-PAC setting and function classes with bounded eluder dimension.

Abstract

We study computationally and statistically efficient reinforcement learning under the linear $Q^{π}$ realizability assumption, where any policy's $Q$ -function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named Frozen Policy Iteration, under the linear $Q^{π}$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $O (d^{2} H^{6} T)$ , where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The delicate exploitation of the loopless deterministic "tree" structure via peeling and the design of $k_t(s)$ is novel - The multi-level uncertainty slicing in Algorithm 2 aligns well with the intuition and is easy to follow

Weaknesses

- On the disjointness assumption of $\mathcal{S}$: It seems that the regret analysis heavily rely on the disjointness assumption of the state space, i.e., $\mathcal{S}\_{h_1} \cap \mathcal{S}\_{h_2} = \emptyset$ when $h_1 \neq h_2$, effectively enforcing the process to be a "tree", e.g., in **the proof of Lemma 18**, etc. This is an unusually strong assumption because such an assumption is often considered acceptable only in adversarial MDPs. ### Minor weaknesses - According to the definition

Reviewer 02Rating 4Confidence 3

Strengths

1. They present a computationally efficient algorithm that operates without simulator restart or resampling, and provide theoretical guarantees via high probability regret bounds, Uniform-PAC results, and extension to function classes based on the eluder dimension. 2. They conduct ablation studies on the effect of freezing across two RL environments and report detailed implementation choices to aid reproducibility.

Weaknesses

1. The theory relies heavily on linear $Q^\pi$ realizability (Assumption 1) and deterministic transitions (Assumption 3). 2. The regret bound and Uniform-PAC bound exhibit significant dependence on the horizon $H$. 3. PAC guarantees may become loose over practically interesting $\varepsilon$ ranges when $\kappa$ is not small. 4. The experiments are limited to verifying Algorithm 1 and ablation studies on freezing, while Algorithm 2—one of the core components of the proposed theory—was neither im

Reviewer 03Rating 6Confidence 3

Strengths

Overall, the paper is well-written and easy to follow. Achieving what appears to be the first regret bound under the linear $Q^\pi$ realizability setting is an interesting result. However, the paper seems to require clearer articulation and positioning of its key contributions.

Weaknesses

1. Weisz et al. (2023) study a similar setting, except that they allow for stochastic transitions. Their algorithm attains only a PAC guarantee, not a regret bound. If their analysis were restricted to deterministic transitions, would their algorithm also achieve a regret bound comparable to FPI? A discussion of this comparison, maybe following Theorem 1, should be included in the paper (including explicit contrasts between the PAC guarantees). 2. The computational cost of the proposed algorith

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Gaussian Processes and Bayesian Inference