Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs

Dan Qiao; Ming Yin; Yu-Xiang Wang

arXiv:2302.12456·cs.LG·February 27, 2023

Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs

Dan Qiao, Ming Yin, Yu-Xiang Wang

PDF

Open Access

TL;DR

This paper introduces the ELEANOR-LowSwitching algorithm for reinforcement learning in linear Bellman-complete MDPs, achieving near-optimal regret with a logarithmic switching cost, extending previous work beyond linear MDPs.

Contribution

It presents a new algorithm with logarithmic switching cost for a broader class of MDPs and establishes lower bounds, advancing the understanding of exploration and policy switching costs.

Findings

01

Achieves near-optimal regret with logarithmic switching cost

02

Proves a lower bound proportional to dH for switching costs

03

Extends the approach to generalized linear function approximation

Abstract

In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$ and feature dimension $d$ . We also prove a lower bound proportional to $d H$ among all algorithms with sublinear regret. In addition, we show the ``doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation,…

Tables1

Table 1. Table 1: Comparison of our results (in blue ) to existing works regarding regret bound and (global) switching cost bound. “Low IBE” is short for low inherent Bellman error while “GLM” represents generalized linear function approximation, where both settings generalize linear MDP. For both “Low IBE” and “GLM” settings, we assume the total reward is bounded by 1 1 1 . In particular, we show the regret bound for “Low IBE” assuming the inherent Bellman error is 0 0 while the detailed result is shown in Theorem 4.1 . We highlight that our switching cost upper bounds under both settings match the corresponding lower bounds up to logarithmic factors. † : Here d h subscript 𝑑 ℎ d_{h} is the dimension of feature map for the h ℎ h -th layer and K 𝐾 K is the number of episodes. When applied to linear MDP, there will be an additional factor of H 𝐻 H in the regret bound while d h = d subscript 𝑑 ℎ 𝑑 d_{h}=d for all h ℎ h . Therefore, regret bound and switching cost bound will be O ~ ( d 2 H 4 K ) ~ 𝑂 superscript 𝑑 2 superscript 𝐻 4 𝐾 \widetilde{O}(\sqrt{d^{2}H^{4}K}) and O ( d H log ⁡ K ) 𝑂 𝑑 𝐻 𝐾 O(dH\log K) , respectively. ⋆ ⋆ \star : When applied to linear MDP, there will be an additional factor of H 𝐻 H in the regret bound, and the regret bound will be O ~ ( d 3 H 4 K ) ~ 𝑂 superscript 𝑑 3 superscript 𝐻 4 𝐾 \widetilde{O}(\sqrt{d^{3}H^{4}K}) . ‡ ‡ {\ddagger} : This result is generalized by Wang et al. ( 2021 ) whose algorithm has a same switching cost bound under this regret bound. ∗ * : The switching cost here is local switching cost (defined in Bai et al. ( 2019 ) ), which is specified to tabular MDP.

Algorithms for regret minimization	Setting	Regret bound	Switching cost bound
Our Algorithm 1 (Theorem 4.1)^†	Low IBE	$\tilde{O} (\sum_{h = 1}^{H} d_{h} \sqrt{K})$	$O (\sum_{h = 1}^{H} d_{h} \log K)$
Our Algorithm 2 (Theorem 6.4)^⋆	GLM	$\tilde{O} (H \sqrt{d^{3} K})$	$O (d H \log K)$
Algorithm 1 of Gao et al. (2021)^‡	Linear MDP	$\tilde{O} (\sqrt{d^{3} H^{4} K})$	$O (d H \log K)$
UCB-Advantage (Zhang et al., 2020)	Tabular MDP	$\tilde{O} (\sqrt{H^{3} S A K})$	$O {(H^{2} S A \log K)}^{*}$
APEVE (Qiao et al., 2022)	Tabular MDP	$\tilde{O} (\sqrt{H^{5} S^{2} A K})$	$O (H S A \log \log K)$
Lower bound (Theorem 4.2)	Low IBE	If “no-regret”	$Ω (\sum_{h = 1}^{H} d_{h})$
Lower bound (Theorem 6.5)	GLM	If “no-regret”	$Ω (d H)$

Equations93

T_{h} (Q_{h + 1}) (s, a) = r_{h} (s, a) + E_{s^{'} \sim P_{h} (\cdot ∣ s, a)} a^{'} max Q_{h + 1} (s^{'}, a^{'}) .

T_{h} (Q_{h + 1}) (s, a) = r_{h} (s, a) + E_{s^{'} \sim P_{h} (\cdot ∣ s, a)} a^{'} max Q_{h + 1} (s^{'}, a^{'}) .

Regret (K) := k = 1 \sum K [V_{1}^{⋆} (s_{1}) - V_{1}^{π_{k}} (s_{1})],

Regret (K) := k = 1 \sum K [V_{1}^{⋆} (s_{1}) - V_{1}^{π_{k}} (s_{1})],

N_{s w i t c h} := k = 1 \sum K - 1 \mathds 1 {π_{k} \neq = π_{k + 1}} .

N_{s w i t c h} := k = 1 \sum K - 1 \mathds 1 {π_{k} \neq = π_{k + 1}} .

B_{h} := {θ_{h} \in R^{d_{h}} ∣ ∣ ϕ_{h} (s, a)^{⊤} θ_{h} ∣ \leq 1, \forall (s, a)},

B_{h} := {θ_{h} \in R^{d_{h}} ∣ ∣ ϕ_{h} (s, a)^{⊤} θ_{h} ∣ \leq 1, \forall (s, a)},

Q_{h} (θ) (s, a) = ϕ_{h} (s, a)^{⊤} θ, V_{h} (θ) (s) = a max ϕ_{h} (s, a)^{⊤} θ .

Q_{h} (θ) (s, a) = ϕ_{h} (s, a)^{⊤} θ, V_{h} (θ) (s) = a max ϕ_{h} (s, a)^{⊤} θ .

Q_{h} := {Q_{h} (θ_{h}) ∣ θ_{h} \in B_{h}}, V_{h} := {V_{h} (θ_{h}) ∣ θ_{h} \in B_{h}} .

Q_{h} := {Q_{h} (θ_{h}) ∣ θ_{h} \in B_{h}}, V_{h} := {V_{h} (θ_{h}) ∣ θ_{h} \in B_{h}} .

∥ ϕ_{h} (s, a) ∥_{2} \leq 1, \forall (h, s, a) \in [H] \times S \times A .

∥ ϕ_{h} (s, a) ∥_{2} \leq 1, \forall (h, s, a) \in [H] \times S \times A .

∥ θ_{h} ∥_{2} \leq d_{h}, \forall h \in [H], θ_{h} \in B_{h} .

∥ θ_{h} ∥_{2} \leq d_{h}, \forall h \in [H], θ_{h} \in B_{h} .

θ_{h + 1} \in B_{h + 1} sup θ_{h} \in B_{h} in f s, a sup ∣ ϕ_{h} (s, a)^{⊤} θ_{h} - (T_{h} Q_{h + 1} (θ_{h + 1})) (s, a) ∣.

θ_{h + 1} \in B_{h + 1} sup θ_{h} \in B_{h} in f s, a sup ∣ ϕ_{h} (s, a)^{⊤} θ_{h} - (T_{h} Q_{h + 1} (θ_{h + 1})) (s, a) ∣.

τ = 1 \sum k - 1 ((ϕ_{h}^{τ})^{⊤} θ - r_{h}^{τ} - V_{h + 1} (θ_{h + 1}) (s_{h + 1}^{τ}))^{2} + λ ∥ θ ∥_{2}^{2},

τ = 1 \sum k - 1 ((ϕ_{h}^{τ})^{⊤} θ - r_{h}^{τ} - V_{h + 1} (θ_{h + 1}) (s_{h + 1}^{τ}))^{2} + λ ∥ θ ∥_{2}^{2},

θ_{h} = (Σ_{h}^{k})^{- 1} τ = 1 \sum k - 1 ϕ_{h}^{τ} [r_{h}^{τ} + V_{h + 1} (θ_{h + 1}) (s_{h + 1}^{τ})],

θ_{h} = (Σ_{h}^{k})^{- 1} τ = 1 \sum k - 1 ϕ_{h}^{τ} [r_{h}^{τ} + V_{h + 1} (θ_{h + 1}) (s_{h + 1}^{τ})],

{\overset{ˉ}{ξ}_{h}}_{h \in [H]}, {θ_{h}}_{h \in [H]}, {\overset{ˉ}{θ}_{h}}_{h \in [H]} max a max ϕ_{1} (s_{1}, a)^{⊤} \overset{ˉ}{θ}_{1} subject to θ_{h} = (Σ_{h}^{k})^{- 1} τ = 1 \sum k - 1 ϕ_{h}^{τ} (r_{h}^{τ} + V_{h + 1} (\overset{ˉ}{θ}_{h + 1}) (s_{h + 1}^{τ})), \overset{ˉ}{θ}_{h} = θ_{h} + \overset{ˉ}{ξ}_{h}; ∥ \overset{ˉ}{ξ}_{h} ∥_{Σ_{h}^{k}} \leq α_{h}^{k}; \overset{ˉ}{θ}_{h} \in B_{h} .

{\overset{ˉ}{ξ}_{h}}_{h \in [H]}, {θ_{h}}_{h \in [H]}, {\overset{ˉ}{θ}_{h}}_{h \in [H]} max a max ϕ_{1} (s_{1}, a)^{⊤} \overset{ˉ}{θ}_{1} subject to θ_{h} = (Σ_{h}^{k})^{- 1} τ = 1 \sum k - 1 ϕ_{h}^{τ} (r_{h}^{τ} + V_{h + 1} (\overset{ˉ}{θ}_{h + 1}) (s_{h + 1}^{τ})), \overset{ˉ}{θ}_{h} = θ_{h} + \overset{ˉ}{ξ}_{h}; ∥ \overset{ˉ}{ξ}_{h} ∥_{Σ_{h}^{k}} \leq α_{h}^{k}; \overset{ˉ}{θ}_{h} \in B_{h} .

∥ \overset{ˉ}{ξ}_{h} ∥_{Σ_{h}^{k}} \leq α_{h}^{k} := O (d_{h} + d_{h + 1}) + k I,

∥ \overset{ˉ}{ξ}_{h} ∥_{Σ_{h}^{k}} \leq α_{h}^{k} := O (d_{h} + d_{h + 1}) + k I,

Regret (K) \leq O (h = 1 \sum H d_{h} K + h = 1 \sum H d_{h} I K) .

Regret (K) \leq O (h = 1 \sum H d_{h} K + h = 1 \sum H d_{h} I K) .

Π_{h = 1}^{H} det (Σ_{h}^{k_{N}}) \geq 2^{N} Π_{h = 1}^{H} det (Σ_{h}^{k_{0}}) .

Π_{h = 1}^{H} det (Σ_{h}^{k_{N}}) \geq 2^{N} Π_{h = 1}^{H} det (Σ_{h}^{k_{0}}) .

\leq Regret (K) = k = 1 \sum K (V_{1}^{⋆} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) H K I + k = 1 \sum K (\overset{ˉ}{V}_{1}^{b_{k}} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) .

\leq Regret (K) = k = 1 \sum K (V_{1}^{⋆} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) H K I + k = 1 \sum K (\overset{ˉ}{V}_{1}^{b_{k}} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) .

(\overset{ˉ}{Q}_{h}^{b_{k}} - T_{h} \overset{ˉ}{Q}_{h + 1}^{b_{k}}) (s, a) \leq I + 2 ∥ ϕ_{h} (s, a) ∥_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}},

(\overset{ˉ}{Q}_{h}^{b_{k}} - T_{h} \overset{ˉ}{Q}_{h + 1}^{b_{k}}) (s, a) \leq I + 2 ∥ ϕ_{h} (s, a) ∥_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}},

\leq \leq + k = 1 \sum K (\overset{ˉ}{V}_{1}^{b_{k}} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) k = 1 \sum K h = 1 \sum H (I + 2 ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}}) + Sum of bounded martingale difference (a) k = 1 \sum K h = 1 \sum H 2 ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}} H K I + O (h = 1 \sum H d_{h} K) .

\leq \leq + k = 1 \sum K (\overset{ˉ}{V}_{1}^{b_{k}} (s_{1}) - V_{1}^{π_{b_{k}}} (s_{1})) k = 1 \sum K h = 1 \sum H (I + 2 ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}}) + Sum of bounded martingale difference (a) k = 1 \sum K h = 1 \sum H 2 ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}} α_{h}^{b_{k}} H K I + O (h = 1 \sum H d_{h} K) .

\leq \leq \leq (a) \leq h = 1 \sum H 2 α_{h}^{K} \cdot K k = 1 \sum K ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}}^{2} h = 1 \sum H 2 α_{h}^{K} \cdot 2 K k = 1 \sum K ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{k})^{- 1}}^{2} O (h = 1 \sum H (K I + d_{h} + d_{h + 1}) \cdot K d_{h}) O (h = 1 \sum H d_{h} K I + h = 1 \sum H d_{h} K),

\leq \leq \leq (a) \leq h = 1 \sum H 2 α_{h}^{K} \cdot K k = 1 \sum K ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{b_{k}})^{- 1}}^{2} h = 1 \sum H 2 α_{h}^{K} \cdot 2 K k = 1 \sum K ϕ_{h} (s_{h}^{k}, a_{h}^{k})_{(Σ_{h}^{k})^{- 1}}^{2} O (h = 1 \sum H (K I + d_{h} + d_{h + 1}) \cdot K d_{h}) O (h = 1 \sum H d_{h} K I + h = 1 \sum H d_{h} K),

G_{up} = {

G_{up} = {

θ \in B_{d}, 0 \leq γ \leq Γ, A ≽ 0, ∥ A ∥_{2} \leq 1} .

θ_{h}^{k} = ar g ∥ θ ∥_{2} \leq 1 min τ = 1 \sum k - 1 (f (⟨ ϕ (s_{h}^{τ}, a_{h}^{τ}), θ ⟩) - r_{h}^{τ} - a^{'} \in A max Q_{h + 1}^{k} (s_{h + 1}^{τ}, a^{'}))^{2} .

θ_{h}^{k} = ar g ∥ θ ∥_{2} \leq 1 min τ = 1 \sum k - 1 (f (⟨ ϕ (s_{h}^{τ}, a_{h}^{τ}), θ ⟩) - r_{h}^{τ} - a^{'} \in A max Q_{h + 1}^{k} (s_{h + 1}^{τ}, a^{'}))^{2} .

Regret (K) \leq O (H d^{3} K) .

Regret (K) \leq O (H d^{3} K) .

Regret (K) \leq O (h = 1 \sum H d_{h} K + h = 1 \sum H d_{h} I K) .

Regret (K) \leq O (h = 1 \sum H d_{h} K + h = 1 \sum H d_{h} I K) .

det (Σ_{h_{i}}^{k_{i + 1}}) \geq 2 det (Σ_{h_{i}}^{k_{i}}) .

det (Σ_{h_{i}}^{k_{i + 1}}) \geq 2 det (Σ_{h_{i}}^{k_{i}}) .

det (Σ_{h}^{k_{i + 1}}) \geq det (Σ_{h}^{k_{i}}) .

det (Σ_{h}^{k_{i + 1}}) \geq det (Σ_{h}^{k_{i}}) .

Π_{h = 1}^{H} det (Σ_{h}^{k_{i + 1}}) \geq 2 Π_{h = 1}^{H} det (Σ_{h}^{k_{i}}) .

Π_{h = 1}^{H} det (Σ_{h}^{k_{i + 1}}) \geq 2 Π_{h = 1}^{H} det (Σ_{h}^{k_{i}}) .

K^{\sum_{h = 1}^{H} d_{h}} \geq Π_{h = 1}^{H} det (Σ_{h}^{k_{N}}) \geq 2^{N} Π_{h = 1}^{H} det (Σ_{h}^{k_{0}}) = 2^{N},

K^{\sum_{h = 1}^{H} d_{h}} \geq Π_{h = 1}^{H} det (Σ_{h}^{k_{N}}) \geq 2^{N} Π_{h = 1}^{H} det (Σ_{h}^{k_{0}}) = 2^{N},

i = 1 \sum k - 1 ϕ_{h}^{i} (r_{h}^{i} - r_{h} (s_{h}^{i}, a_{h}^{i}) + V_{h + 1} (s_{h + 1}^{i}) - E_{s^{'} \sim P_{h} (\cdot ∣ s_{h}^{i}, a_{h}^{i})} V_{h + 1} (s^{'}))_{(Σ_{h}^{k})^{- 1}} \leq β_{h}^{k},

i = 1 \sum k - 1 ϕ_{h}^{i} (r_{h}^{i} - r_{h} (s_{h}^{i}, a_{h}^{i}) + V_{h + 1} (s_{h + 1}^{i}) - E_{s^{'} \sim P_{h} (\cdot ∣ s_{h}^{i}, a_{h}^{i})} V_{h + 1} (s^{'}))_{(Σ_{h}^{k})^{- 1}} \leq β_{h}^{k},

θ_{h}^{⋆} = ar g θ \in B_{h} min (s, a) sup ϕ_{h} (s, a)^{⊤} θ - (T_{h} Q_{h + 1} (θ_{h + 1}^{⋆})) (s, a)

θ_{h}^{⋆} = ar g θ \in B_{h} min (s, a) sup ϕ_{h} (s, a)^{⊤} θ - (T_{h} Q_{h + 1} (θ_{h + 1}^{⋆})) (s, a)

(s, a) sup ∣ Q_{h}^{⋆} (s, a) - ϕ_{h} (s, a)^{⊤} θ_{h}^{⋆} ∣ \leq (H - h + 1) I .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization

Full text

Logarithmic Switching Cost in Reinforcement Learning

beyond Linear MDPs

Dan Qiao

Department of Computer Science, UC Santa Barbara

Ming Yin

Department of Computer Science, UC Santa Barbara

Department of Statistics and Applied Probability, UC Santa Barbara

Yu-Xiang Wang

Department of Computer Science, UC Santa Barbara

Abstract

In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$ and feature dimension $d$ . We also prove a lower bound proportional to $dH$ among all algorithms with sublinear regret. In addition, we show the “doubling trick” used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation, under which we design a sample-efficient algorithm with near-optimal switching cost.

1 Introduction
1.1 Related works
2 Problem setup
2.1 Low inherent Bellman error
3 Main algorithm
4 Main results
5 Proof sketch
5.1 Upper bounds
5.2 Lower bound
6 Extension to generalized linear function approximation
6.1 Problem setup
6.2 Low switching algorithm
6.3 Main results of Algorithm 2
7 Conclusion and future work
A Proof of Theorem 4.1
A.1 Proof of switching cost bound
A.2 Proof of regret bound
B Proof of Theorem 4.2
C Proof for Section 6
C.1 Proof of upper bounds
C.2 Proof of lower bound
D Assisting technical lemmas

1 Introduction

In many real-world reinforcement learning (RL) tasks, limited computing resources make it challenging to apply fully adaptive algorithms that continually update the exploration policy. As a surrogate, it is more cost-effective to collect data in large batches using the current policy and make changes to the policy after the entire batch is completed. For example, in a recommendation system (Afsar et al., 2021), it is easier to gather new data quickly, but deploying a new policy takes longer as it requires significant computing and human resources. Therefore, it’s not feasible to switch policies based on real-time data, as typical RL algorithms would require. A practical solution is to run several experiments in parallel and make decisions on policy updates only after the entire batch has been completed. Similar limitations occur in other RL based applications such as healthcare (Yu et al., 2021), robotics (Kober et al., 2013), and new material design (Zhou et al., 2019), where the agent must minimize the number of policy updates while still learning an effective policy using a similar number of trajectories as fully-adaptive methods. On the theoretical side, Bai et al. (2019) brought up the definition of switching cost, which measures the number of policy updates. In this paper, we measure the adaptivity of online reinforcement learning algorithms via global switching cost, and we leave the formal definition to Section 2.

In recent years, there has been a growing interest in designing online reinforcement learning algorithms with low switching costs (Bai et al., 2019; Zhang et al., 2020; Qiao et al., 2022; Gao et al., 2021; Wang et al., 2021; Kong et al., 2021; Velegkas et al., 2022). While much progress has been made in achieving near-optimal results, most of the research has focused on the tabular MDP setting and the slightly more general linear MDP setting (Yang and Wang, 2019; Jin et al., 2020). However, linear MDP is still a restrictive model, and subsequent works have proposed a variety of more general settings, such as low inherent Bellman error (Zanette et al., 2020), generalized linear function approximation (Wang et al., 2019), low Bellman rank (Jiang et al., 2017), low rank (Agarwal et al., 2020), and low Bellman eluder dimension (Jin et al., 2021). Therefore, it is natural to question whether reinforcement learning with low switching cost is achievable under these more general MDP settings.

Our contributions. In this paper, we extend previous results under linear MDP to its two natural extensions, linear Bellman-complete MDPs with low inherent Bellman error (Zanette et al., 2020) and MDP with genaralized linear function approximation (Wang et al., 2019). Under both settings, we design algorithms with near optimal regret and switching cost. Our contributions are three-fold and summarized as below.

•

A new algorithm (Algorithm 1) based on “doubling trick” for regret minimization under the low inherent Bellman error setting that achieves global switching cost of $O(\sum_{h=1}^{H}d_{h}\log K)$ and regret of $\widetilde{O}\left(\sum_{h=1}^{H}d_{h}\sqrt{K}+\sum_{h=1}^{H}\sqrt{d_{h}}\mathcal{I}K\right)$ , where $d_{h}$ is the dimension of feature map for the $h$ -th layer, $\mathcal{I}$ is the inherent Bellman error and $K$ is the number of episodes (Theorem 4.1). The regret bound is known to be minimax optimal (Zanette et al., 2020).

•

When the inherent Bellman error $\mathcal{I}=0$ , we prove a nearly matching switching cost lower bound (Theorem 4.2) $\Omega(\sum_{h=1}^{H}d_{h})$ for any algorithm with sub-linear regret bound, which implies that the switching cost of our Algorithm 1 is optimal up to $\log K$ factor. When applied to linear MDP, Algorithm 1 achieves the same switching cost and better regret bound compared to the previous results (Gao et al., 2021; Wang et al., 2021).

•

We leverage the “doubling trick” used in Algorithm 1 under the generalized linear function approximation setting and propose Algorithm 2 which achieves switching cost of $O(dH\log K)$ and regret of $\widetilde{O}\left(H\sqrt{d^{3}K}\right)$ , where $d$ is the dimension of feature map (Theorem 6.4). We also prove a nearly matching switching cost lower bound of $\Omega(dH)$ for any algorithm with sub-linear regret bound (Theorem 6.5). The pair of results strictly generalize previous results under linear MDP (Gao et al., 2021; Wang et al., 2021).

1.1 Related works

There is a large and growing body of literature on the statistical theory of reinforcement learning that we will not attempt to thoroughly review. Detailed comparisons with existing work on reinforcement learning with low switching cost (Gao et al., 2021; Wang et al., 2021; Zhang et al., 2020; Qiao et al., 2022) are given in Table 1. Notably, the settings we consider are more general than the well studied tabular or linear MDP, while our results for regret and switching cost are comparable or better than the best known results under linear MDP (Gao et al., 2021; Wang et al., 2021). While there are low adaptive algorithms under other more general settings than linear MDP, they either consider only pure exploration (without regret guarantee) (Jiang et al., 2017; Sun et al., 2019), or suffer from sub-optimal results comparing to our results (Kong et al., 2021; Velegkas et al., 2022).

In addition to switching cost, there are other measurements of adaptivity. The closest measurement is batched learning, which requires decisions about policy updates to be made at only a few (often predefined) checkpoints but does not constrain the number of policy switches. Batched learning has been considered both under bandits (Perchet et al., 2016; Gao et al., 2019) and RL (Wang et al., 2021; Qiao et al., 2022; Zhang et al., 2022b) while the settings are restricted to tabular MDP or linear MDP. Meanwhile, Matsushima et al. (2020) proposed the notion of deployment efficiency, which is similar to batched RL with additional requirement that each policy deployment should have similar size. Deployment efficient RL is studied by some following works (Huang et al., 2022; Qiao and Wang, 2022; Modi et al., 2021). However, as pointed out by Qiao and Wang (2022), deployment complexity is not a good measurement of adaptivity when studying regret minimization.

Technically speaking, we directly base on ELEANOR (Zanette et al., 2020) and Algorithm 1 of Wang et al. (2019), which admit fully adaptive structure. We apply “doubling trick” when deciding whether to update the exploration policy, in order to achieve low switching cost. In particular, we show that the “information gain” used in previous works under linear MDP (Gao et al., 2021; Wang et al., 2021): the determinant of empirical covariance matrix can be extended to more general MDPs with linear approximation. Therefore, we only update the exploration policy when the “information gain” doubles, and the switching cost depends only logarithmically on the number of episodes $K$ .

2 Problem setup

Notations. Throughout the paper, for $n\in\mathbb{Z}^{+}$ , $[n]=\{1,2,\cdots,n\}$ . We denote $\|x\|_{\Lambda}=\sqrt{x^{\top}\Lambda x}$ . For matrix $X\in\mathbb{R}^{d\times d}$ , $\|\cdot\|_{2}$ , $\det(\cdot)$ , $\lambda_{\min}(\cdot)$ , $\lambda_{\max}(\cdot)$ denote the operator norm, determinant, smallest eigenvalue and largest eigenvalue, respectively. In addition, we use standard notations such as $O$ and $\Omega$ to absorb constants while $\widetilde{O}$ and $\widetilde{\Omega}$ suppress logarithmic factors.

Markov Decision Processes. We consider finite-horizon episodic Markov Decision Processes (MDP) with non-stationary transitions, denoted by a tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},H,P_{h},r_{h})$ (Sutton and Barto, 1998), where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space and $H$ is the horizon. The non-stationary transition kernel has the form $P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]$ with $P_{h}(s^{\prime}|s,a)$ representing the probability of transition from state $s$ , action $a$ to next state $s^{\prime}$ at time step $h$ . In addition, $r_{h}(s,a)\in\Delta([0,1])$ denotes the corresponding distribution of reward.111We overload the notation $r$ so that $r$ also denotes the expected (immediate) reward function. Without loss of generality, we assume there is a fixed initial state $s_{1}$ .222The generalized case where the initial distribution is an arbitrary distribution can be recovered from this setting by adding one layer to the MDP. A policy can be seen as a series of mapping $\pi=(\pi_{1},\cdots,\pi_{H})$ , where each $\pi_{h}$ maps each state $s\in\mathcal{S}$ to a probability distribution over actions, i.e. $\pi_{h}:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ , $\forall\,h\in[H]$ . A random trajectory $(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H},s_{H+1})$ is generated by the following rule: $s_{1}$ is fixed, $a_{h}\sim\pi_{h}(\cdot|s_{h}),r_{h}\sim r_{h}(s_{h},a_{h}),s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h}),\forall\,h\in[H]$ . For normalization, we assume that $\sum_{h=1}^{H}r_{h}\in[0,1]$ almost surely.

$Q$ -values, Bellman operator. Given a policy $\pi$ and any $h\in[H]$ , the value function $V^{\pi}_{h}(\cdot)$ and Q-value function $Q^{\pi}_{h}(\cdot,\cdot)$ are defined as: $V^{\pi}_{h}(s)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h}=s],Q^{\pi}_{h}(s,a)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h},a_{h}=s,a],\;\forall\,s,a\in\mathcal{S}\times\mathcal{A}.$ Besides, the value function and Q-value function with respect to the optimal policy $\pi^{\star}$ is denoted by $V^{\star}_{h}(\cdot)$ and $Q^{\star}_{h}(\cdot,\cdot)$ . Then the Bellman operator $\mathcal{T}_{h}$ applied to $Q_{h+1}$ is defined as

[TABLE]

Regret. We measure the performance of online reinforcement learning algorithms by the regret. The regret of an algorithm over $K$ episodes is defined as

[TABLE]

where $\pi_{k}$ is the policy it deploys at episode $k$ . Besides, we denote the total number of steps by $T:=KH$ .

Switching cost. We adopt the global switching cost (Bai et al., 2019), which simply measures how many times the algorithm changes its policy:

[TABLE]

Global switching cost is a widely applied measurement of the adaptivity of an online RL algorithm both under the tabular setting (Bai et al., 2019; Zhang et al., 2020; Qiao et al., 2022) and the linear MDP setting (Gao et al., 2021; Wang et al., 2021). Similar to previous works, our algorithm also uses deterministic policies only.

2.1 Low inherent Bellman error

In this part, we introduce the linear function approximation, the definition of inherent Bellman error (Zanette et al., 2020) and the connection between the low inherent Bellman error setting and the linear MDP setting (Jin et al., 2020).

To encode linear function approximation of the state space $\mathcal{S}$ , a common approach is to define a feature map $\phi_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d_{h}}$ , which can be different across different timestep. Then the $Q$ -value functions are represented as linear functions of $\phi_{h}$ , i.e., $Q_{h}(s,a)=\phi_{h}(s,a)^{\top}\theta_{h}$ for some $\theta_{h}\in\mathbb{R}^{d_{h}}$ .

The feasible parameter class for timestep $h$ is defined as

[TABLE]

which is consistent with our assumption that $Q^{\pi}_{h}(s,a)\leq 1$ .

For each feasible parameter $\theta\in\mathcal{B}_{h}$ , the corresponding $Q$ -value function and value function are defined as

[TABLE]

Meanwhile, the associated function spaces are

[TABLE]

Similar to Zanette et al. (2020), we make the following normalization assumption, which is without loss of generality.

[TABLE]

Inherent Bellman error. For provably efficient learning, completeness assumption is widely adopted (Zanette et al., 2020; Wang et al., 2020; Jin et al., 2021). In this paper, we characterize the completeness by assuming an upper bound of the projection error when we project $\mathcal{T}_{h}Q_{h+1}$ ( $Q_{h+1}\in\mathcal{Q}_{h+1}$ ) to $\mathcal{Q}_{h}$ . Formally, we have the following definition of inherent Bellman error.

Definition 2.1.

The inherent Bellman error of an MDP with a known linear feature map $\{\phi_{h}(\cdot,\cdot)\}_{h\in[H]}$ is defined as the maximum over the timesteps $h\in[H]$ of

[TABLE]

Similar to Zanette et al. (2020), we assume the inherent Bellman error of the MDP is upper bounded by some (known) constant $\mathcal{I}\geq 0$ . Below we will show that this setting strictly generalizes the linear MDP setting (Jin et al., 2020).

Connections to linear MDP. Since linear MDP admits transition kernel and reward function that is linear in a known feature map $\phi$ , for any function $V(\cdot):\mathcal{S}\rightarrow\mathbb{R}$ , $\mathcal{T}_{h}V(\cdot,\cdot)$ is a linear function of $\phi(\cdot,\cdot)$ (Jin et al., 2020). Therefore, a linear MDP with feature map $\phi$ and dimension $d$ is a special case of the low inherent Bellman error setting with $\mathcal{I}=0$ , $\phi_{1}=\cdots=\phi_{H}=\phi$ and $d_{1}=\cdots=d_{H}=d$ (if ignoring the scale of rewards). More importantly, it is shown that an MDP with zero inherent Bellman error ( $\mathcal{I}=0$ ) may not be a linear MDP (Zanette et al., 2020), which means that the setting in this paper is strictly more general and technically demanding than linear MDP. For more discussions about the low inherent Bellman error setting and relavent comparisons, please refer to Section 3 in Zanette et al. (2020).

3 Main algorithm

In this section, we propose our main algorithm: ELEANOR-LowSwitching (Algorithm 1) and the low switching design for global optimism-based algorithms.

We begin with the standard LSVI technique. At the beginning of the $k$ -th episode, assume the parameter for the $(h+1)$ -th layer is fixed to be $\theta_{h+1}$ . Then LSVI minimizes the following objective function with respect to $\theta$ :

[TABLE]

where $\phi_{h}^{\tau}$ is short for $\phi_{h}(s_{h}^{\tau},a_{h}^{\tau})$ and $r_{h}^{\tau}$ is the reward encountered at layer $h$ of the $\tau$ -th episode. The minimization problem (1) has a closed form solution:

[TABLE]

where $\Sigma_{h}^{k}=\sum_{\tau=1}^{k-1}\phi_{h}^{\tau}(\phi_{h}^{\tau})^{\top}+\lambda I_{d_{h}}$ is the empirical covariance matrix.

Based on the standard LSVI, we introduce the global optimistic planning below, where an optimization problem is solved to derive the most optimistic estimate of the Q-value function at the initial state. At each episode where the policy is updated, Algorithm 1 solves the following problem.

Definition 3.1 (Optimistic planning).

[TABLE]

Definition 3.1 optimizes over the perturbation $\bar{\xi}_{h}$ added to the least square solution $\widehat{\theta}_{h}$ . The constraint on $\bar{\xi}_{h}$ is

[TABLE]

where the definition of $\sqrt{\alpha_{h}^{k}}$ will be specified in Appendix A.2. As will be shown in the analysis, the first term accounts for the estimation error of the LSVI, while the second term accounts for the model misspecification (recall that $\mathcal{I}$ is inherent Bellman error). Finally, with high probability, there will be a valid solution of the optimization problem (details in Appendix A.2), and therefore Algorithm 1 is well posed.

About global optimism. We highlight that the optimization problem aims at being optimistic only at the initial state instead of choosing a value function everywhere optimistic, as in LSVI-UCB (Jin et al., 2020). Such global optimism effectively keeps the linear structure of our function class and reduces the dimension of the covering set, since we do not need to cover the quadratic bonus as in Jin et al. (2020).

Algorithmic design. We present the whole learning process in Algorithm 1. For linear function approximation, we characterize the “information gain” (the information we learned from interacting with the MDP) through the determinant of the empirical covariance matrix $\Sigma_{h}^{k}$ (line 5). To achieve low switching cost, we only update the exploration policy when the “information gain” doubles for some layer $h\in[H]$ (line 7), and each update means the information about some layer has doubled. As will be shown later, such “doubling schedule” will lead to a switching cost depending only logarithmically on $K$ , in stark contrast to its fully adaptive counterpart: ELEANOR (Zanette et al., 2020). When an update occurs, Algorithm 1 solves the optimization problem to derive $\{\bar{\theta}_{h}\}_{h\in[H]}$ ensuring global optimism (line 9), takes the greedy policy with respect to $\phi_{h}(\cdot,\cdot)^{\top}\bar{\theta}_{h}^{k}$ (line 10) and updates the empirical covariance matrix (line 11).

Generalization over previous algorithms. If we remove the update rule in Algorithm 1 and solve Definition 3.1 at all episodes, our Algorithm 1 will degenerate to ELEANOR (Zanette et al., 2020). Compared to ELEANOR, our Algorithm 1 achieves the same regret bound (shown later) and near optimal switching cost. Meanwhile, Algorithm 1 also strictly generalizes the RARELY SWITCHING OFUL algorithm (Abbasi-Yadkori et al., 2011) designed for linear bandits. Taking $H=1$ , both our Algorithm 1 and our guarantees (for regret and switching cost) strictly subsumes the RARELY SWITCHING OFUL. In conclusion, we show that low switching cost is possible for RL algorithms with global optimism.

Computational efficiency. Although Algorithm 1 is shown to be near optimal both in regret and switching cost, the implementation of the optimization problem is inefficient in general. This is because the max operator breaks the quadratic structure of the constraints. Such issue also exists for our fully adaptive counterpart: ELEANOR (Zanette et al., 2020), and other algorithms based on global optimism (Jiang et al., 2017; Sun et al., 2019; Jin et al., 2021). We leave the improvement of computation as future work.

4 Main results

In this section, we present our main results. We begin with the upper bounds for regret and switching cost. Recall that we assume $\sum_{h=1}^{H}r_{h}\in[0,1]$ almost surely, while $d_{h}$ represents the dimension of the feature map for the $h$ -th layer and $\mathcal{I}$ is inherent Bellman error.

Theorem 4.1 (Main theorem).

The global switching cost of Algorithm 1 is bounded by $O(\sum_{h=1}^{H}d_{h}\cdot\log K)$ . In addition, with probability $1-\delta$ , the regret of Algorithm 1 over $K$ episodes is bounded by

[TABLE]

The proof of Theorem 4.1 is sketched in Section 5.1 with details in the Appendix, below we discuss several interesting aspects of Theorem 4.1.

Near-optimal switching cost. Our algorithm achieves a switching cost that depends logarithmically on $K$ , which improves the $O(K)$ switching cost of ELEANOR (Zanette et al., 2020). We also prove the following information-theoretic limit which says that the switching cost of Algorithm 1 is optimal up to logarithmic factors. Since it is impossible to get sub-linear regret bound with positive inherent Bellman error, we only consider the case where $\mathcal{I}=0$ .

Theorem 4.2 (Lower bound for no-regret learning).

Assume that the inherent Bellman error $\mathcal{I}=0$ and $d_{h}\geq 3$ for all $h\in[H]$ , for any algorithm with sub-linear regret bound, the global switching cost is at least $\Omega(\sum_{h=1}^{H}d_{h})$ .

The proof of Theorem 4.2 is sketched in Section 5.2 with details in the Appendix.

Application to linear MDP. As discussed in Section 2.1, linear MDP with dimension $d$ is a special case of the low inherent Bellman error setting with $\mathcal{I}=0$ , $d_{1}=d_{2}=\cdots=d_{H}=d$ . Therefore, when applied to linear MDP, our Algorithm 1 will have switching cost bounded by $O(dH\log K)$ and regret bounded by $\widetilde{O}(\sqrt{d^{2}H^{3}T})$ , where $T=KH$ .333When transferring Theorem 4.1 to linear MDP, we need to rescale the reward function by $H$ , and therefore there will be an additional factor of $H$ in our regret bound. Compared to current algorithms achieving low switching cost under linear MDP (Gao et al., 2021; Wang et al., 2021), we achieve the same switching cost and a regret bound better by a factor of $\sqrt{d}$ . The improvement on regret bound results from global optimism and a smaller linear function class. More importantly, low inherent Bellman error setting is indeed a harder setting than linear MDP. According to Theorem 2 in Zanette et al. (2020), the regret of our Algorithm 1 is minimax optimal. Together with the lower bound of switching cost (Theorem 4.2), Theorem 4.1 is generally not improvable both in regret and global switching cost.

Application to misspecified linear bandits. Taking $H=1$ , an MDP with low inherent Bellman error will become a linear bandit (Lattimore and Szepesvári, 2020) with model misspecification. For simplicity, we only consider the case where there is no misspecification (i.e. $\mathcal{I}=0$ ), as studied in Abbasi-Yadkori et al. (2011). Our result is summarized in the following corollary.

Corollary 4.3 (Results under linear bandit).

Suppose $H=1$ and $\mathcal{I}=0$ , then the MDP reduces to a linear bandit with dimension $d$ . Our Algorithm 1 will reduce to the RARELY SWITCHING OFUL algorithm (Figure 3 in Abbasi-Yadkori et al. (2011)) and is computationally efficient. The global switching cost of Algorithm 1 is $O(d\log K)$ , while the regret can be bounded by $\widetilde{O}(d\sqrt{K})$ with high probability.

The above corollary is derived by directly plugging $H=1$ and $d_{1}=d$ in Theorem 4.1. Note that our Corollary 4.3 matches the results in Abbasi-Yadkori et al. (2011), and our Algorithm 1 can be applied under the more general case with model misspecification. Therefore, our results can be seen as strict generalization of Abbasi-Yadkori et al. (2011).

5 Proof sketch

Due to the space constraint, we sketch the proof in this section while more details are deferred to the Appendix. We begin with the proof overview of Theorem 4.1.

5.1 Upper bounds

Upper bound of switching cost. Let $\{k_{1},k_{2},\cdots,k_{N}\}$ be the episodes where the algorithm updates the policy (N is the global switching cost), and we also define $k_{0}=0$ .

According to the update rule (line 7 of Algorithm 1), every time the policy is updated, at least one $\det(\Sigma_{h}^{k})$ doubles, which implies that $\Pi_{h=1}^{H}\det(\Sigma_{h}^{k_{i+1}})\geq 2\Pi_{h=1}^{H}\det(\Sigma_{h}^{k_{i}})$ for all $i\in[N]$ . This further implies

[TABLE]

Since the left hand side can be upper bounded by $K^{\sum_{h=1}^{H}d_{h}}$ (details in Lemma D.2) and the right hand side is just $2^{N}$ (from definition), the global switching cost (i.e. $N$ ) is bounded by $O(\sum_{h=1}^{H}d_{h}\log K)$ .

Below we give a proof overview of the regret bound.

Upper bound of regret. We denote $\bar{Q}_{h}^{k}(\cdot,\cdot)=Q_{h}(\bar{\theta}_{h}^{k})(\cdot,\cdot)=\phi_{h}(\cdot,\cdot)^{\top}\bar{\theta}_{h}^{k}$ , where $\bar{\theta}_{h}^{k}$ is the solution of Definition 3.1 at the $k$ -th episode. Similarly, $\bar{V}_{h}^{k}(\cdot)=V_{h}(\bar{\theta}_{h}^{k})(\cdot)$ . In addition, let $b_{k}$ denote the last policy update before episode $k$ , for all $k\in[K]$ .

Based on concentration inequalities of self-normalized processes, we can show that with high probability, the “best feasible” approximant parameter $\theta^{\star}$ (Definition A.3) is a feasible solution of Definition 3.1. Therefore, the $\bar{V}_{1}^{k}(s_{1})$ is always a nearly optimistic estimate of $V_{1}^{\star}(s_{1})$ (summarized in Lemma A.5) and we only need to bound

[TABLE]

Meanwhile, the pointwise Bellman error can be bounded as (this result is stated in Lemma A.6)

[TABLE]

where $\sqrt{\alpha_{h}^{k}}\leq\sqrt{K}\mathcal{I}+\widetilde{O}(\sqrt{d_{h}+d_{h+1}})$ .

As a result, applying regret decomposition accross different layers $h\in[H]$ and bounding the martingale difference by Azuma-Hoeffding inequality (Lemma D.1), we have

[TABLE]

Due to our update rule based on $\det(\Sigma_{h}^{k})$ , we have

[TABLE]

where the second inequality holds because of Lemma D.3 and our update rule. The third inequality is from elliptical potential lemma (Lemma D.4).

Finally, the regret bound results from plugging (6) into (5).

5.2 Lower bound

In this part, we sketch the proof of Theorem 4.2.

We construct a hard MDP case with zero inherent Bellman error ( $\mathcal{I}=0$ ), which has deterministic transition kernel. Therefore, deploying some deterministic policy will lead to a deterministic trajectory, like pulling an “arm” in the multi-armed bandits (MAB) setting. We further show that the number of such “arms” is at least $\Omega(\sum_{h=1}^{H}d_{h})$ . Together with the lower bounds of switching cost in multi-armed bandits (Qiao et al., 2022), we can derive the $\Omega(\sum_{h=1}^{H}d_{h})$ lower bound under the low inherent Bellman error setting.

6 Extension to generalized linear function approximation

In this section, we consider low adaptive reinforcement learning with generalized linear function approximation (Wang et al., 2019). We show that the same “doubling schedule” for updating exploration policy (line 7 of Algorithm 1) can be leveraged under this setting, which enables the design of provably efficient algorithms. We begin with the introduction of generalized linear function approximation.

6.1 Problem setup

Different from the low inherent Bellman error setting which characterizes $Q^{\star}$ using linear functions, we use a function class of generalized linear models (GLMs) to model $Q^{\star}$ . We denote the dimension of feature map by $d$ and define $\mathbb{B}_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\}$ .

Definition 6.1 (GLM (Wang et al., 2019)).

For a known feature map $\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{B}_{d}$ and a known link function $f:[-1,1]\rightarrow[-1,1]$ , the class of generalized linear models is $\mathcal{G}=\{(s,a)\rightarrow f(\langle\phi(s,a),\theta\rangle):\theta\in\mathbb{B}_{d}\}$ .

Similar to Wang et al. (2019), we make the following standard assumption which is without loss of generality.

Assumption 6.2.

$f(\cdot)$ * is either monotonically increasing or decreasing. Furthermore, there exist absolute constants $0<\kappa_{1}<\kappa_{2}<\infty$ and $M<\infty$ such that $\kappa_{1}\leq|f^{\prime}(z)|\leq\kappa_{2}$ and $|f^{\prime\prime}(z)|\leq M$ , for all $|z|\leq 1$ .*

This assumption is naturally satisfied by the identical map $f(z)=z$ and also includes other non-linear maps such as the logistic map $f(z)=1/(1+e^{-z})$ .

To characterize completeness under this function class, Wang et al. (2019) assumes the function class is closed with respect to the Bellman operator $\mathcal{T}_{h}$ (defined in Section 2). Similarly, we make the same optimistic closure assumption below. Note that for a fixed constant $\Gamma>0$ 444 $\Gamma$ will be set to depend polynomially on $d$ and $\log K$ ., the enlarged function class is defined as

[TABLE]

Then the optimistic closure assumption is stated below.

Assumption 6.3.

For all $h\in[H]$ and $g\in\mathcal{G}_{\text{up}}$ , we have $\mathcal{T}_{h}(g)\in\mathcal{G}$ .

According to Proposition 1 of Wang et al. (2019), this assumption strictly generalizes the standard linear MDP setting by allowing link functions with more expressivity.

6.2 Low switching algorithm

We present our Algorithm 2 below. Intuitively speaking, the algorithmic idea is to apply doubling schedule to Algorithm 1 of Wang et al. (2019). Similar to Algorithm 1, we only update the exploration policy when the “information gain” with respect to some layer has doubled (line 7). When the policy is updated, the LSVI step calculates an estimate of $\theta^{\star}$ (the parameter w.r.t. the real $Q^{\star}$ function) iteratively from the $H$ -th layer to the first layer through minimizing (7). Then the optimistic $Q$ value function is constructed by adding a bonus term $\gamma\|\phi(\cdot,\cdot)\|_{(\Sigma_{h}^{k})^{-1}}$ to the empirical estimate $f(\phi(\cdot,\cdot)^{\top}\theta_{h}^{k})$ (line 11). Finally, the greedy policy is deployed for collecting data (line 12, 18).

6.3 Main results of Algorithm 2

In this part, we state the main results about Algorithm 2. We begin with the upper bounds for regret and switching cost. Recall that we still assume $\sum_{h=1}^{H}r_{h}\in[0,1]$ almost surely, while $d$ represents the dimension of the feature map.

Theorem 6.4 (Main results).

The global switching cost of Algorithm 2 is bounded by $O(dH\cdot\log K)$ . In addition, with probability $1-\delta$ , the regret of Algorithm 2 over $K$ episodes is bounded by

[TABLE]

The proof of Theorem 6.4 is deferred to Appendix C due to space limit, below we discuss several interesting aspects of Theorem 6.4.

Near-optimal switching cost. Our algorithm achieves a switching cost that depends logarithmically on $K$ , which improves the $O(K)$ switching cost of Algorithm 1 in Wang et al. (2019). We also prove the following information-theoretic limit which says that the switching cost of Algorithm 2 is optimal up to logarithmic factors.

Theorem 6.5 (Lower bound for no-regret learning).

For any algorithm with sub-linear regret bound, the global switching cost is at least $\Omega(dH)$ .

Theorem 6.5 is adapted from the lower bound for global switching cost under linear MDP (Gao et al., 2021), and we leave the proof to Appendix C.

Generalization over previous results. The closest result to our Algorithm 2 is the fully adaptive Algorithm 1 of Wang et al. (2019), which achieves the same $\widetilde{O}\left(H\sqrt{d^{3}K}\right)$ regret bound. In comparison, our Algorithm 2 favors near optimal global switching cost at the same time, which saves computation and accelerates the learning process.

When applying our Algorithm 2 to the linear MDP case, our Theorem 6.4 will imply a regret bound of $\widetilde{O}(\sqrt{d^{3}H^{3}T})$ 555The identical link function corresponds to $\kappa_{1}=\kappa_{2}=1$ and $M=0$ . In addition, due to rescaling of reward functions, there will be an additional $H$ factor in the regret bound of Theorem 6.4. ( $T=KH$ ) and a global switching cost of $O(dH\log K)$ , which recovers the results in Gao et al. (2021); Wang et al. (2021). Therefore, our result can be considered as generalization of these two results since GLMs allow more general function classes.

7 Conclusion and future work

This paper studied the well motivated problem of online reinforcement learning with low switching cost. Under linear Bellman-complete MDP with low inherent Bellman error, we designed an algorithm (Algorithm 1) with near optimal regret bound of $\widetilde{O}\left(\sum_{h=1}^{H}d_{h}\sqrt{K}+\sum_{h=1}^{H}\sqrt{d_{h}}\mathcal{I}K\right)$ and global switching cost bound of $O(\sum_{h=1}^{H}d_{h}\cdot\log K)$ . In addition, we prove a (nearly) matching global switching cost lower bound $\Omega(\sum_{h=1}^{H}d_{h})$ for any algorithm with sub-linear regret. At the same time, we leverage the same “doubling trick” under the generalized linear function approximation setting, and designed a sample-efficient algorithm (Algorithm 2) with near optimal switching cost.

Although being more general than linear MDP, the two settings we consider are not the most general ones. The low Bellman eluder dimension setting (Jin et al., 2021) and MDP with differentiable function approximation (Zhang et al., 2022a) can be considered as generalization of the two settings in this paper, respectively. Therefore, our results can be considered as a middle step towards low switching reinforcement learning under more general MDP settings. For further extension, it will be interesting to find out whether low switching cost RL is possible under more general MDP settings (e.g., low Bellman eluder dimension (Jin et al., 2021), differentiable function class (Zhang et al., 2022a; Yin et al., 2023)), and we leave these as future work.

Acknowledgments

The research is partially supported by NSF Award #2007117.

Appendix A Proof of Theorem 4.1

In this section, we prove our main theorem. We first restate Theorem 4.1 below, and then prove the bounds for switching cost and regret in Section A.1 and Section A.2, respectively.

Theorem A.1 (Restate Theorem 4.1).

The global switching cost of Algorithm 1 is bounded by $O(\sum_{h=1}^{H}d_{h}\cdot\log K)$ . In addition, with probability $1-\delta$ , the regret of Algorithm 1 over $K$ episodes is bounded by

[TABLE]

A.1 Proof of switching cost bound

Proof of switching cost bound.

Let $\{k_{1},k_{2},\cdots,k_{N}\}$ be the episodes where the algorithm updates the policy, and we also define $k_{0}=0$ .

According to the update rule (line 7 of Algorithm 1), for all $i\in[N]$ , there exists some $h_{i}\in[H]$ such that

[TABLE]

In addition, for all $h,i\in[H]\times[N]$ , we have

[TABLE]

Combining these two results, we have for all $i\in[N]$ ,

[TABLE]

Therefore, it holds that

[TABLE]

where the first inequality is because of Lemma D.2 and our choice that $\lambda=1$ . The second inequality is due to recursive application of (8). The last equation holds since we have $\Sigma_{h}^{k_{0}}=I_{d_{h}}$ for all $h$ .

Solving (9), we have $N\leq\frac{\sum_{h=1}^{H}d_{h}\log K}{\log 2}=O(\sum_{h=1}^{H}d_{h}\log K)$ , and therefore the proof is complete. ∎

A.2 Proof of regret bound

We first state some technical lemmas from Zanette et al. [2020]. We begin with the following bound on failure probability.

Lemma A.2 (Lemma 2 of Zanette et al. [2020]).

With probability at least $1-\delta/2$ , for all $k\in[K]$ , $h\in[H]$ , $V_{h+1}\in\mathcal{V}_{h+1}$ ,

[TABLE]

where $\sqrt{\beta_{h}^{k}}:=\sqrt{d_{h}\log(1+k/d_{h})+2d_{h+1}\log(1+4\sqrt{kd_{h}})+\log(\frac{2KH}{\delta})}+1=\widetilde{O}(\sqrt{d_{h}+d_{h+1}})$ .

Next, we define the “best” feasible parameters $\theta^{\star}$ that well approximate the $Q^{\star}$ values, and such parameters are going to be a feasible solution for the optimization problem (Definition 3.1). Then we state the accuracy bound of $\theta^{\star}$ .

Definition A.3 (Best feasible approximant, Definition 4 of Zanette et al. [2020]).

We recursively define the best approximant parameter $\theta^{\star}_{h}$ for $h\in[H]$ as:

[TABLE]

with ties broken arbitrarily and $\theta^{\star}_{H+1}=0$ .

Lemma A.4 (Accuracy Bound of $\theta^{\star}$ , Lemma 6 of Zanette et al. [2020]).

It holds that for all $h\in[H]$ :

[TABLE]

For notational simplicity, for $\bar{\theta}_{h}$ which is the solution of Definition 3.1, we denote $\bar{Q}_{h}(\cdot,\cdot)=Q_{h}(\bar{\theta}_{h})(\cdot,\cdot)=\phi_{h}(\cdot,\cdot)^{\top}\bar{\theta}_{h}$ . Besides, $\bar{Q}_{h}^{k}$ represents $Q_{h}(\bar{\theta}_{h}^{k})$ where $\bar{\theta}_{h}^{k}$ is the solution at the $k$ -th episode. Similarly, $\bar{V}_{h}(\cdot)=V_{h}(\bar{\theta}_{h})(\cdot)$ and $\bar{V}_{h}^{k}(\cdot)=V_{h}(\bar{\theta}_{h}^{k})(\cdot)$ . In addition, let $b_{k}$ denote the last policy update before episode $k$ , for all $k\in[K]$ .

Lemma A.5 (Optimism, Lemma 7 of Zanette et al. [2020]).

Under the high probability case in Lemma A.2, if we choose $\sqrt{\alpha_{h}^{k}}=\sqrt{\beta_{h}^{k}}+\sqrt{k}\mathcal{I}+\sqrt{d_{h}}=\sqrt{k}\mathcal{I}+\widetilde{O}(\sqrt{d_{h}+d_{h+1}})$ , then $\bar{\theta}_{h}=\theta_{h}^{\star}$ , for all $h\in[H]$ is a feasible solution of the optimization problem (Definition 3.1). Therefore, for all $k\in[K]$ , the optimistic value function satisfies

[TABLE]

In addition to optimism, we also have the following upper bound of Bellman error.

Lemma A.6 (Bound of Bellman error, Lemma 1 of Zanette et al. [2020]).

Under the high probability case in Lemma A.2, it holds that for all $(k,h,s,a)\in[K]\times[H]\times\mathcal{S}\times\mathcal{A}$ ,

[TABLE]

Now we are ready to present the regret analysis of Algorithm 1.

Proof of regret bound.

We prove based on the high probability case in Lemma A.2.

First of all, the regret over $K$ episodes can be decomposed as

[TABLE]

where the last inequality results from Lemma A.5.

Note that $\bar{V}_{h}^{b_{k}}(s_{h}^{k})=\bar{Q}_{h}^{b_{k}}(s_{h}^{k},a_{h}^{k})$ due to our choice of $\pi_{k}$ , it holds that for all $k,h\in[K]\times[H]$ ,

[TABLE]

where the inequality holds because of Lemma A.6.

Plugging (16) into (15), we have with probability $1-\delta$ ,

[TABLE]

where the second inequality is because of (16). The last inequality holds with high probability due to Azuma-Hoeffding inequality (Lemma D.1) and the fact that $\|\bar{V}_{h+1}^{b_{k}}\|_{\infty}\leq\|\bar{\theta}_{h+1}^{b_{k}}\|_{2}\leq\sqrt{d_{h+1}}$ for any $k\in[K]$ .

Finally, it holds that

[TABLE]

where the second inequality holds according to Cauchy-Schwarz inequality and the fact that $\alpha_{h}^{k}$ is non-decreasing in $k$ . The third inequality results from Lemma D.3 and the fact that $\det((\Sigma_{h}^{b_{k}})^{-1})=\det(\Sigma_{h}^{b_{k}})^{-1}\leq 2\det(\Sigma_{h}^{k})^{-1}=2\det((\Sigma_{h}^{k})^{-1})$ . The forth inequality is because of elliptical potential lemma (Lemma D.4). The fifth inequality is derived by the definition of $\alpha_{h}^{K}$ (from Lemma A.5). The last inequality comes from direct calculation.

The regret analysis is complete. ∎

Appendix B Proof of Theorem 4.2

In this section, we prove our lower bound of switching cost.

Theorem B.1 (Restate Theorem 4.2).

Assume that the inherent Bellman error $\mathcal{I}=0$ and $d_{h}\geq 3$ for all $h\in[H]$ , for any algorithm with sub-linear regret bound, the global switching cost is at least $\Omega(\sum_{h=1}^{H}d_{h})$ .

We first briefly discuss about our assumptions. We assume zero inherent Bellman error (i.e. $\mathcal{I}=0$ ) since it is possible to derive sub-linear regret bounds only if $\mathcal{I}=0$ , and we want to derive lower bounds of switching cost for algorithms with sub-linear regret. Otherwise, the regret bound will always be linear in $K$ . Also, the assumption on $d_{h}\geq 3$ for all $h\in[H]$ is without loss of generality.

Proof of Theorem B.1.

We first construct an MDP with two states, the initial state $s_{1}$ and the absorbing state $s_{2}$ .

For absorbing state $s_{2}$ , the choice of action is only $a_{0}$ , while for initial state $s_{1}$ , the choice of actions at layer $h$ is $\{a_{1},a_{2},\cdots,a_{d_{h}-1}\}$ . Then we define the $d_{h}$ -dimensional feature map for the $h$ -th layer:

[TABLE]

where for $s_{1},a_{i}$ ( $i\in[d_{h}-1]$ ), the $(i+1)$ -th element is $1$ while all other elements are [math].

We now define the transition kernel and reward function as $P_{h}(s_{2}|s_{2},a_{0})=1$ , $r_{h}(s_{2},a_{0})=0$ , $P_{h}(s_{1}|s_{1},a_{1})=1$ , $r_{h}(s_{1},a_{1})=0$ for all $h\in[H]$ . Besides, $P_{h}(s_{2}|s_{1},a_{i})=1$ , $r_{h}(s_{1},a_{i})=r_{h,i}$ for all $h\in[H]$ and $2\leq i\leq d_{h}$ , where $r_{h,i}$ ’s are unknown non-zero values. Note that such MDP has zero inherent Bellman error ( $\mathcal{I}=0$ ) since the function class $\{\phi_{h}(s,a)^{\top}\theta_{h}\;|\;\theta_{h}\in\mathcal{B}_{h}\}$ includes all possible Q-value functions.

Therefore, for any deterministic policy, the only possible case is that the agent takes action $a_{1}$ and stays at $s_{1}$ for the first $h-1$ steps, then at step $h$ the agent takes action $a_{i}$ ( $i\geq 2$ ) and transitions to $s_{2}$ with reward $r_{h,i}$ , later the agent always stays at $s_{2}$ with no more reward. For this trajectory, the total reward will be $r_{h,i}$ . Also, for any deterministic policy, the trajectory is fixed, like pulling an “arm” in multi-armed bandits setting. Note that the total number of such “arms” with non-zero unknown reward is at least $\sum_{h=1}^{H}(d_{h}-2)=\Omega(\sum_{h=1}^{H}d_{h})$ due to our assumption that $d_{h}\geq 3$ . Even if the transition kernel is known to the agent, this MDP is still as difficult as a multi-armed bandits problem with $\Omega(\sum_{h=1}^{H}d_{h})$ arms. Together will Lemma B.2 below, the proof is complete. ∎

Lemma B.2 (Lemma H.4 of Qiao et al. [2022]).

For any algorithm with sub-linear regret bound under $K$ -armed bandit problem, the switching cost is at least $\Omega(K)$ .

Appendix C Proof for Section 6

In this section, we prove the theorems regarding our Algorithm 2 under the generalized linear function approximation setting. We begin with the upper bounds for switching cost and regret.

C.1 Proof of upper bounds

Theorem C.1 (Restate Theorem 6.4).

The global switching cost of Algorithm 2 is bounded by $O(dH\cdot\log K)$ . In addition, with probability $1-\delta$ , the regret of Algorithm 2 over $K$ episodes is bounded by

[TABLE]

Proof of switching cost bound.

Since the feature map in Algorithm 2 satisfies that for all $s,a\in\mathcal{S}\times\mathcal{A}$ , $\phi(s,a)\in\mathbb{B}_{d}=\{x\in\mathbb{R}^{d}:\|x\|_{2}\leq 1\}$ , we have $\|\phi(s,a)\|_{2}\leq 1$ . Therefore, the conclusion of Lemma D.2 still holds, with $d_{h}=d$ for all $h\in[H]$ . In addition, because our policy update rule (line 7 of Algorithm 2) is identical to Algorithm 1, the $O(dH\cdot\log K)$ upper bound of switching cost results from identical proof as in Section A.1, with all $d_{h}$ replaced by $d$ . ∎

Before we prove the upper bound of regret, we state some technical lemmas from Wang et al. [2019].

Lemma C.2 (Corollary 3 of Wang et al. [2019]).

We denote the estimated Q value function of layer $h$ at the $k$ -th episode by $Q_{h}^{k}(\cdot,\cdot)$ . Suppose there exists a function $\text{conf}_{h}^{k}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{+}$ such that for all $(k,h,s,a)\in[K]\times[H]\times\mathcal{S}\times\mathcal{A}$ ,

[TABLE]

(where $\mathcal{T}_{h}$ is Bellman operator) and the policy $\pi_{k}$ is the greedy policy with respect to $Q_{h}^{k}$ , then with probability at least $1-\delta$ ,

[TABLE]

Lemma C.2 is a standard regret decomposition which will be used to bound the regret of Algorithm 2. Below we give a valid choice of the confidence bound $\text{conf}_{h}^{k}$ . Note that we define $b_{k}$ to be the last policy update before episode $k$ , for all $k\in[K]$ . Therefore, $Q_{h}^{k}=Q_{h}^{b_{k}}$ for all $k\in[K]$ .

Lemma C.3 (Adapted from Lemma 6 of Wang et al. [2019]).

With probability $1-\delta$ , it holds that for all $k,h,s,a\in[K]\times[H]\times\mathcal{S}\times\mathcal{A}$ ,

[TABLE]

where $\gamma$ is defined in Algorithm 2.

Therefore, optimism is straightforward.

Lemma C.4 (Corollary 5 of Wang et al. [2019]).

Under the high probability case in Lemma C.3, for all $k,h,s,a\in[K]\times[H]\times\mathcal{S}\times\mathcal{A}$ , $Q_{h}^{k}(s,a)\geq Q_{h}^{\star}(s,a)$ .

Combining optimism (Lemma C.4) with Lemma C.3, we have that $Q_{h}^{k}$ in Algorithm 2 satisfies condition (19) with $\text{conf}_{h}^{k}(s,a)=\gamma\|\phi(s,a)\|_{(\Sigma_{h}^{b_{k}})^{-1}}$ . Below we bound the summation of bonus.

Lemma C.5.

Assume that $\text{conf}^{k}_{h}(s,a)=\gamma\|\phi(s,a)\|_{(\Sigma_{h}^{b_{k}})^{-1}}$ , then it holds that

[TABLE]

Proof of Lemma C.5.

[TABLE]

where the first inequality holds according to Cauchy-Schwarz inequality. The second inequality results from Lemma D.3 and the fact that $\det((\Sigma_{h}^{b_{k}})^{-1})=\det(\Sigma_{h}^{b_{k}})^{-1}\leq 2\det(\Sigma_{h}^{k})^{-1}=2\det((\Sigma_{h}^{k})^{-1})$ . The third inequality is because of elliptical potential lemma (Lemma D.4). ∎

Now we are ready to present the proof of the regret upper bound.

Proof of regret upper bound.

The final $\widetilde{O}(H\sqrt{d^{3}K})$ regret upper bound is derived by combining Lemma C.2, Lemma C.5 and the definition that $\gamma=\widetilde{O}(d)$ . ∎

C.2 Proof of lower bound

Finally, we present the proof of the lower bound.

Theorem C.6 (Restate Theorem 6.5).

For any algorithm with sub-linear regret bound, the global switching cost is at least $\Omega(dH)$ .

Proof of Theorem C.6.

Since linear MDP is a special case of generalized linear function approximation, the $\Omega(dH)$ lower bound of global switching cost in Gao et al. [2021] holds here. ∎

Appendix D Assisting technical lemmas

Lemma D.1 (Azuma-Hoeffding inequality).

Let $X_{i}$ be a martingale difference sequence such that $X_{i}\in[-A,A]$ for some $A>0$ . Then with probability at least $1-\delta$ , it holds that:

[TABLE]

Lemma D.2 (Lemma C.1 of Wang et al. [2021]).

Let $\{\Sigma_{h}^{k}\}_{(h,k)\in[H]\times[K]}$ be as defined in Algorithm 1. Then for all $h\in[H]$ and $k\in[K]$ , we have $\det(\Sigma_{h}^{k})\leq(\lambda+\frac{k-1}{d_{h}})^{d_{h}}$ .

Lemma D.3 (Lemma 12 of Abbasi-Yadkori et al. [2011]).

Suppose $A,B\in\mathbb{R}^{d\times d}$ are two positive definite matrices satisfying that $A\succcurlyeq B$ , then for any $x\in\mathbb{R}^{d}$ , we have

[TABLE]

Lemma D.4 (Elliptical Potential Lemma, Lemma 26 of Agarwal et al. [2020]).

Consider a sequence of $d\times d$ positive semi-definite matrices $X_{1},\cdots,X_{T}$ with $\max_{t}Tr(X_{t})\leq 1$ and define $M_{0}=I,\cdots,M_{t}=M_{t-1}+X_{t}$ . Then

[TABLE]

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems , pages 2312–2320, 2011.
2Afsar et al. [2021] M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ar Xiv preprint ar Xiv:2101.06286 , 2021.
3Agarwal et al. [2020] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems , 33:20095–20107, 2020.
4Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems , 32, 2019.
5Gao et al. [2021] Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. ar Xiv preprint ar Xiv:2101.00494 , 2021.
6Gao et al. [2019] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems , 32, 2019.
7Huang et al. [2022] Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, and Tie-Yan Liu. Towards deployment-efficient reinforcement learning: Lower bound and optimality. In International Conference on Learning Representations , 2022.
8Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning-Volume 70 , pages 1704–1713, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Logarithmic Switching Cost in Reinforcement Learning

Abstract

Contents

1 Introduction

1.1 Related works

2 Problem setup

2.1 Low inherent Bellman error

Definition 2.1**.**

3 Main algorithm

Definition 3.1** (Optimistic planning).**

4 Main results

Theorem 4.1** (Main theorem).**

Theorem 4.2** (Lower bound for no-regret learning).**

Corollary 4.3** (Results under linear bandit).**

5 Proof sketch

5.1 Upper bounds

5.2 Lower bound

6 Extension to generalized linear function approximation

6.1 Problem setup

Definition 6.1** (GLM (Wang et al., 2019)).**

Assumption 6.2**.**

Assumption 6.3**.**

6.2 Low switching algorithm

6.3 Main results of Algorithm 2

Theorem 6.4** (Main results).**

Theorem 6.5** (Lower bound for no-regret learning).**

7 Conclusion and future work

Acknowledgments

Appendix A Proof of Theorem 4.1

Theorem A.1** (Restate Theorem 4.1).**

A.1 Proof of switching cost bound

Proof of switching cost bound.

A.2 Proof of regret bound

Lemma A.2** (Lemma 2 of Zanette et al. [2020]).**

Definition A.3** (Best feasible approximant, Definition 4 of Zanette et al. [2020]).**

Lemma A.4** (Accuracy Bound of θ⋆\theta^{\star}θ⋆, Lemma 6 of Zanette et al. [2020]).**

Lemma A.5** (Optimism, Lemma 7 of Zanette et al. [2020]).**

Lemma A.6** (Bound of Bellman error, Lemma 1 of Zanette et al. [2020]).**

Proof of regret bound.

Appendix B Proof of Theorem 4.2

Theorem B.1** (Restate Theorem 4.2).**

Proof of Theorem B.1.

Lemma B.2** (Lemma H.4 of Qiao et al. [2022]).**

Appendix C Proof for Section 6

C.1 Proof of upper bounds

Theorem C.1** (Restate Theorem 6.4).**

Proof of switching cost bound.

Lemma C.2** (Corollary 3 of Wang et al. [2019]).**

Lemma C.3** (Adapted from Lemma 6 of Wang et al. [2019]).**

Lemma C.4** (Corollary 5 of Wang et al. [2019]).**

Lemma C.5**.**

Proof of Lemma C.5.

Proof of regret upper bound.

C.2 Proof of lower bound

Theorem C.6** (Restate Theorem 6.5).**

Proof of Theorem C.6.

Appendix D Assisting technical lemmas

Lemma D.1** (Azuma-Hoeffding inequality).**

Lemma D.2** (Lemma C.1 of Wang et al. [2021]).**

Lemma D.3** (Lemma 12 of Abbasi-Yadkori et al. [2011]).**

Lemma D.4** (Elliptical Potential Lemma, Lemma 26 of Agarwal et al. [2020]).**

Definition 2.1.

Definition 3.1 (Optimistic planning).

Theorem 4.1 (Main theorem).

Theorem 4.2 (Lower bound for no-regret learning).

Corollary 4.3 (Results under linear bandit).

Definition 6.1 (GLM (Wang et al., 2019)).

Assumption 6.2.

Assumption 6.3.

Theorem 6.4 (Main results).

Theorem 6.5 (Lower bound for no-regret learning).

Theorem A.1 (Restate Theorem 4.1).

Lemma A.2 (Lemma 2 of Zanette et al. [2020]).

Definition A.3 (Best feasible approximant, Definition 4 of Zanette et al. [2020]).

Lemma A.4 (Accuracy Bound of $\theta^{\star}$ , Lemma 6 of Zanette et al. [2020]).

Lemma A.5 (Optimism, Lemma 7 of Zanette et al. [2020]).

Lemma A.6 (Bound of Bellman error, Lemma 1 of Zanette et al. [2020]).

Theorem B.1 (Restate Theorem 4.2).

Lemma B.2 (Lemma H.4 of Qiao et al. [2022]).

Theorem C.1 (Restate Theorem 6.4).

Lemma C.2 (Corollary 3 of Wang et al. [2019]).

Lemma C.3 (Adapted from Lemma 6 of Wang et al. [2019]).

Lemma C.4 (Corollary 5 of Wang et al. [2019]).

Lemma C.5.

Theorem C.6 (Restate Theorem 6.5).

Lemma D.1 (Azuma-Hoeffding inequality).

Lemma D.2 (Lemma C.1 of Wang et al. [2021]).

Lemma D.3 (Lemma 12 of Abbasi-Yadkori et al. [2011]).

Lemma D.4 (Elliptical Potential Lemma, Lemma 26 of Agarwal et al. [2020]).