Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs
Dan Qiao, Ming Yin, Yu-Xiang Wang

TL;DR
This paper introduces the ELEANOR-LowSwitching algorithm for reinforcement learning in linear Bellman-complete MDPs, achieving near-optimal regret with a logarithmic switching cost, extending previous work beyond linear MDPs.
Contribution
It presents a new algorithm with logarithmic switching cost for a broader class of MDPs and establishes lower bounds, advancing the understanding of exploration and policy switching costs.
Findings
Achieves near-optimal regret with logarithmic switching cost
Proves a lower bound proportional to dH for switching costs
Extends the approach to generalized linear function approximation
Abstract
In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon and feature dimension . We also prove a lower bound proportional to among all algorithms with sublinear regret. In addition, we show the ``doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation,…
| Algorithms for regret minimization | Setting | Regret bound | Switching cost bound |
|---|---|---|---|
| Our Algorithm 1 (Theorem 4.1)† | Low IBE | ||
| Our Algorithm 2 (Theorem 6.4)⋆ | GLM | ||
| Algorithm 1 of Gao et al. (2021)‡ | Linear MDP | ||
| UCB-Advantage (Zhang et al., 2020) | Tabular MDP | ||
| APEVE (Qiao et al., 2022) | Tabular MDP | ||
| Lower bound (Theorem 4.2) | Low IBE | If “no-regret” | |
| Lower bound (Theorem 6.5) | GLM | If “no-regret” |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization
Logarithmic Switching Cost in Reinforcement Learning
beyond Linear MDPs
Dan Qiao
Department of Computer Science, UC Santa Barbara
Ming Yin
Department of Computer Science, UC Santa Barbara
Department of Statistics and Applied Probability, UC Santa Barbara
Yu-Xiang Wang
Department of Computer Science, UC Santa Barbara
Abstract
In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon and feature dimension . We also prove a lower bound proportional to among all algorithms with sublinear regret. In addition, we show the “doubling trick” used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation, under which we design a sample-efficient algorithm with near-optimal switching cost.
Contents
1 Introduction
In many real-world reinforcement learning (RL) tasks, limited computing resources make it challenging to apply fully adaptive algorithms that continually update the exploration policy. As a surrogate, it is more cost-effective to collect data in large batches using the current policy and make changes to the policy after the entire batch is completed. For example, in a recommendation system (Afsar et al., 2021), it is easier to gather new data quickly, but deploying a new policy takes longer as it requires significant computing and human resources. Therefore, it’s not feasible to switch policies based on real-time data, as typical RL algorithms would require. A practical solution is to run several experiments in parallel and make decisions on policy updates only after the entire batch has been completed. Similar limitations occur in other RL based applications such as healthcare (Yu et al., 2021), robotics (Kober et al., 2013), and new material design (Zhou et al., 2019), where the agent must minimize the number of policy updates while still learning an effective policy using a similar number of trajectories as fully-adaptive methods. On the theoretical side, Bai et al. (2019) brought up the definition of switching cost, which measures the number of policy updates. In this paper, we measure the adaptivity of online reinforcement learning algorithms via global switching cost, and we leave the formal definition to Section 2.
In recent years, there has been a growing interest in designing online reinforcement learning algorithms with low switching costs (Bai et al., 2019; Zhang et al., 2020; Qiao et al., 2022; Gao et al., 2021; Wang et al., 2021; Kong et al., 2021; Velegkas et al., 2022). While much progress has been made in achieving near-optimal results, most of the research has focused on the tabular MDP setting and the slightly more general linear MDP setting (Yang and Wang, 2019; Jin et al., 2020). However, linear MDP is still a restrictive model, and subsequent works have proposed a variety of more general settings, such as low inherent Bellman error (Zanette et al., 2020), generalized linear function approximation (Wang et al., 2019), low Bellman rank (Jiang et al., 2017), low rank (Agarwal et al., 2020), and low Bellman eluder dimension (Jin et al., 2021). Therefore, it is natural to question whether reinforcement learning with low switching cost is achievable under these more general MDP settings.
Our contributions. In this paper, we extend previous results under linear MDP to its two natural extensions, linear Bellman-complete MDPs with low inherent Bellman error (Zanette et al., 2020) and MDP with genaralized linear function approximation (Wang et al., 2019). Under both settings, we design algorithms with near optimal regret and switching cost. Our contributions are three-fold and summarized as below.
- •
A new algorithm (Algorithm 1) based on “doubling trick” for regret minimization under the low inherent Bellman error setting that achieves global switching cost of and regret of , where is the dimension of feature map for the -th layer, is the inherent Bellman error and is the number of episodes (Theorem 4.1). The regret bound is known to be minimax optimal (Zanette et al., 2020).
- •
When the inherent Bellman error , we prove a nearly matching switching cost lower bound (Theorem 4.2) for any algorithm with sub-linear regret bound, which implies that the switching cost of our Algorithm 1 is optimal up to factor. When applied to linear MDP, Algorithm 1 achieves the same switching cost and better regret bound compared to the previous results (Gao et al., 2021; Wang et al., 2021).
- •
We leverage the “doubling trick” used in Algorithm 1 under the generalized linear function approximation setting and propose Algorithm 2 which achieves switching cost of and regret of , where is the dimension of feature map (Theorem 6.4). We also prove a nearly matching switching cost lower bound of for any algorithm with sub-linear regret bound (Theorem 6.5). The pair of results strictly generalize previous results under linear MDP (Gao et al., 2021; Wang et al., 2021).
1.1 Related works
There is a large and growing body of literature on the statistical theory of reinforcement learning that we will not attempt to thoroughly review. Detailed comparisons with existing work on reinforcement learning with low switching cost (Gao et al., 2021; Wang et al., 2021; Zhang et al., 2020; Qiao et al., 2022) are given in Table 1. Notably, the settings we consider are more general than the well studied tabular or linear MDP, while our results for regret and switching cost are comparable or better than the best known results under linear MDP (Gao et al., 2021; Wang et al., 2021). While there are low adaptive algorithms under other more general settings than linear MDP, they either consider only pure exploration (without regret guarantee) (Jiang et al., 2017; Sun et al., 2019), or suffer from sub-optimal results comparing to our results (Kong et al., 2021; Velegkas et al., 2022).
In addition to switching cost, there are other measurements of adaptivity. The closest measurement is batched learning, which requires decisions about policy updates to be made at only a few (often predefined) checkpoints but does not constrain the number of policy switches. Batched learning has been considered both under bandits (Perchet et al., 2016; Gao et al., 2019) and RL (Wang et al., 2021; Qiao et al., 2022; Zhang et al., 2022b) while the settings are restricted to tabular MDP or linear MDP. Meanwhile, Matsushima et al. (2020) proposed the notion of deployment efficiency, which is similar to batched RL with additional requirement that each policy deployment should have similar size. Deployment efficient RL is studied by some following works (Huang et al., 2022; Qiao and Wang, 2022; Modi et al., 2021). However, as pointed out by Qiao and Wang (2022), deployment complexity is not a good measurement of adaptivity when studying regret minimization.
Technically speaking, we directly base on ELEANOR (Zanette et al., 2020) and Algorithm 1 of Wang et al. (2019), which admit fully adaptive structure. We apply “doubling trick” when deciding whether to update the exploration policy, in order to achieve low switching cost. In particular, we show that the “information gain” used in previous works under linear MDP (Gao et al., 2021; Wang et al., 2021): the determinant of empirical covariance matrix can be extended to more general MDPs with linear approximation. Therefore, we only update the exploration policy when the “information gain” doubles, and the switching cost depends only logarithmically on the number of episodes .
2 Problem setup
Notations. Throughout the paper, for , . We denote . For matrix , , , , denote the operator norm, determinant, smallest eigenvalue and largest eigenvalue, respectively. In addition, we use standard notations such as and to absorb constants while and suppress logarithmic factors.
Markov Decision Processes. We consider finite-horizon episodic Markov Decision Processes (MDP) with non-stationary transitions, denoted by a tuple (Sutton and Barto, 1998), where is the state space, is the action space and is the horizon. The non-stationary transition kernel has the form with representing the probability of transition from state , action to next state at time step . In addition, denotes the corresponding distribution of reward.111We overload the notation so that also denotes the expected (immediate) reward function. Without loss of generality, we assume there is a fixed initial state .222The generalized case where the initial distribution is an arbitrary distribution can be recovered from this setting by adding one layer to the MDP. A policy can be seen as a series of mapping , where each maps each state to a probability distribution over actions, i.e. , . A random trajectory is generated by the following rule: is fixed, . For normalization, we assume that almost surely.
-values, Bellman operator. Given a policy and any , the value function and Q-value function are defined as: Besides, the value function and Q-value function with respect to the optimal policy is denoted by and . Then the Bellman operator applied to is defined as
[TABLE]
Regret. We measure the performance of online reinforcement learning algorithms by the regret. The regret of an algorithm over episodes is defined as
[TABLE]
where is the policy it deploys at episode . Besides, we denote the total number of steps by .
Switching cost. We adopt the global switching cost (Bai et al., 2019), which simply measures how many times the algorithm changes its policy:
[TABLE]
Global switching cost is a widely applied measurement of the adaptivity of an online RL algorithm both under the tabular setting (Bai et al., 2019; Zhang et al., 2020; Qiao et al., 2022) and the linear MDP setting (Gao et al., 2021; Wang et al., 2021). Similar to previous works, our algorithm also uses deterministic policies only.
2.1 Low inherent Bellman error
In this part, we introduce the linear function approximation, the definition of inherent Bellman error (Zanette et al., 2020) and the connection between the low inherent Bellman error setting and the linear MDP setting (Jin et al., 2020).
To encode linear function approximation of the state space , a common approach is to define a feature map , which can be different across different timestep. Then the -value functions are represented as linear functions of , i.e., for some .
The feasible parameter class for timestep is defined as
[TABLE]
which is consistent with our assumption that .
For each feasible parameter , the corresponding -value function and value function are defined as
[TABLE]
Meanwhile, the associated function spaces are
[TABLE]
Similar to Zanette et al. (2020), we make the following normalization assumption, which is without loss of generality.
[TABLE]
[TABLE]
Inherent Bellman error. For provably efficient learning, completeness assumption is widely adopted (Zanette et al., 2020; Wang et al., 2020; Jin et al., 2021). In this paper, we characterize the completeness by assuming an upper bound of the projection error when we project () to . Formally, we have the following definition of inherent Bellman error.
Definition 2.1**.**
The inherent Bellman error of an MDP with a known linear feature map is defined as the maximum over the timesteps of
[TABLE]
Similar to Zanette et al. (2020), we assume the inherent Bellman error of the MDP is upper bounded by some (known) constant . Below we will show that this setting strictly generalizes the linear MDP setting (Jin et al., 2020).
Connections to linear MDP. Since linear MDP admits transition kernel and reward function that is linear in a known feature map , for any function , is a linear function of (Jin et al., 2020). Therefore, a linear MDP with feature map and dimension is a special case of the low inherent Bellman error setting with , and (if ignoring the scale of rewards). More importantly, it is shown that an MDP with zero inherent Bellman error () may not be a linear MDP (Zanette et al., 2020), which means that the setting in this paper is strictly more general and technically demanding than linear MDP. For more discussions about the low inherent Bellman error setting and relavent comparisons, please refer to Section 3 in Zanette et al. (2020).
3 Main algorithm
In this section, we propose our main algorithm: ELEANOR-LowSwitching (Algorithm 1) and the low switching design for global optimism-based algorithms.
We begin with the standard LSVI technique. At the beginning of the -th episode, assume the parameter for the -th layer is fixed to be . Then LSVI minimizes the following objective function with respect to :
[TABLE]
where is short for and is the reward encountered at layer of the -th episode. The minimization problem (1) has a closed form solution:
[TABLE]
where is the empirical covariance matrix.
Based on the standard LSVI, we introduce the global optimistic planning below, where an optimization problem is solved to derive the most optimistic estimate of the Q-value function at the initial state. At each episode where the policy is updated, Algorithm 1 solves the following problem.
Definition 3.1** (Optimistic planning).**
[TABLE]
Definition 3.1 optimizes over the perturbation added to the least square solution . The constraint on is
[TABLE]
where the definition of will be specified in Appendix A.2. As will be shown in the analysis, the first term accounts for the estimation error of the LSVI, while the second term accounts for the model misspecification (recall that is inherent Bellman error). Finally, with high probability, there will be a valid solution of the optimization problem (details in Appendix A.2), and therefore Algorithm 1 is well posed.
About global optimism. We highlight that the optimization problem aims at being optimistic only at the initial state instead of choosing a value function everywhere optimistic, as in LSVI-UCB (Jin et al., 2020). Such global optimism effectively keeps the linear structure of our function class and reduces the dimension of the covering set, since we do not need to cover the quadratic bonus as in Jin et al. (2020).
Algorithmic design. We present the whole learning process in Algorithm 1. For linear function approximation, we characterize the “information gain” (the information we learned from interacting with the MDP) through the determinant of the empirical covariance matrix (line 5). To achieve low switching cost, we only update the exploration policy when the “information gain” doubles for some layer (line 7), and each update means the information about some layer has doubled. As will be shown later, such “doubling schedule” will lead to a switching cost depending only logarithmically on , in stark contrast to its fully adaptive counterpart: ELEANOR (Zanette et al., 2020). When an update occurs, Algorithm 1 solves the optimization problem to derive ensuring global optimism (line 9), takes the greedy policy with respect to (line 10) and updates the empirical covariance matrix (line 11).
Generalization over previous algorithms. If we remove the update rule in Algorithm 1 and solve Definition 3.1 at all episodes, our Algorithm 1 will degenerate to ELEANOR (Zanette et al., 2020). Compared to ELEANOR, our Algorithm 1 achieves the same regret bound (shown later) and near optimal switching cost. Meanwhile, Algorithm 1 also strictly generalizes the RARELY SWITCHING OFUL algorithm (Abbasi-Yadkori et al., 2011) designed for linear bandits. Taking , both our Algorithm 1 and our guarantees (for regret and switching cost) strictly subsumes the RARELY SWITCHING OFUL. In conclusion, we show that low switching cost is possible for RL algorithms with global optimism.
Computational efficiency. Although Algorithm 1 is shown to be near optimal both in regret and switching cost, the implementation of the optimization problem is inefficient in general. This is because the max operator breaks the quadratic structure of the constraints. Such issue also exists for our fully adaptive counterpart: ELEANOR (Zanette et al., 2020), and other algorithms based on global optimism (Jiang et al., 2017; Sun et al., 2019; Jin et al., 2021). We leave the improvement of computation as future work.
4 Main results
In this section, we present our main results. We begin with the upper bounds for regret and switching cost. Recall that we assume almost surely, while represents the dimension of the feature map for the -th layer and is inherent Bellman error.
Theorem 4.1** (Main theorem).**
The global switching cost of Algorithm 1 is bounded by . In addition, with probability , the regret of Algorithm 1 over episodes is bounded by
[TABLE]
The proof of Theorem 4.1 is sketched in Section 5.1 with details in the Appendix, below we discuss several interesting aspects of Theorem 4.1.
Near-optimal switching cost. Our algorithm achieves a switching cost that depends logarithmically on , which improves the switching cost of ELEANOR (Zanette et al., 2020). We also prove the following information-theoretic limit which says that the switching cost of Algorithm 1 is optimal up to logarithmic factors. Since it is impossible to get sub-linear regret bound with positive inherent Bellman error, we only consider the case where .
Theorem 4.2** (Lower bound for no-regret learning).**
Assume that the inherent Bellman error and for all , for any algorithm with sub-linear regret bound, the global switching cost is at least .
The proof of Theorem 4.2 is sketched in Section 5.2 with details in the Appendix.
Application to linear MDP. As discussed in Section 2.1, linear MDP with dimension is a special case of the low inherent Bellman error setting with , . Therefore, when applied to linear MDP, our Algorithm 1 will have switching cost bounded by and regret bounded by , where .333When transferring Theorem 4.1 to linear MDP, we need to rescale the reward function by , and therefore there will be an additional factor of in our regret bound. Compared to current algorithms achieving low switching cost under linear MDP (Gao et al., 2021; Wang et al., 2021), we achieve the same switching cost and a regret bound better by a factor of . The improvement on regret bound results from global optimism and a smaller linear function class. More importantly, low inherent Bellman error setting is indeed a harder setting than linear MDP. According to Theorem 2 in Zanette et al. (2020), the regret of our Algorithm 1 is minimax optimal. Together with the lower bound of switching cost (Theorem 4.2), Theorem 4.1 is generally not improvable both in regret and global switching cost.
Application to misspecified linear bandits. Taking , an MDP with low inherent Bellman error will become a linear bandit (Lattimore and Szepesvári, 2020) with model misspecification. For simplicity, we only consider the case where there is no misspecification (i.e. ), as studied in Abbasi-Yadkori et al. (2011). Our result is summarized in the following corollary.
Corollary 4.3** (Results under linear bandit).**
Suppose and , then the MDP reduces to a linear bandit with dimension . Our Algorithm 1 will reduce to the RARELY SWITCHING OFUL algorithm (Figure 3 in Abbasi-Yadkori et al. (2011)) and is computationally efficient. The global switching cost of Algorithm 1 is , while the regret can be bounded by with high probability.
The above corollary is derived by directly plugging and in Theorem 4.1. Note that our Corollary 4.3 matches the results in Abbasi-Yadkori et al. (2011), and our Algorithm 1 can be applied under the more general case with model misspecification. Therefore, our results can be seen as strict generalization of Abbasi-Yadkori et al. (2011).
5 Proof sketch
Due to the space constraint, we sketch the proof in this section while more details are deferred to the Appendix. We begin with the proof overview of Theorem 4.1.
5.1 Upper bounds
Upper bound of switching cost. Let be the episodes where the algorithm updates the policy (N is the global switching cost), and we also define .
According to the update rule (line 7 of Algorithm 1), every time the policy is updated, at least one doubles, which implies that for all . This further implies
[TABLE]
Since the left hand side can be upper bounded by (details in Lemma D.2) and the right hand side is just (from definition), the global switching cost (i.e. ) is bounded by .
Below we give a proof overview of the regret bound.
Upper bound of regret. We denote , where is the solution of Definition 3.1 at the -th episode. Similarly, . In addition, let denote the last policy update before episode , for all .
Based on concentration inequalities of self-normalized processes, we can show that with high probability, the “best feasible” approximant parameter (Definition A.3) is a feasible solution of Definition 3.1. Therefore, the is always a nearly optimistic estimate of (summarized in Lemma A.5) and we only need to bound
[TABLE]
Meanwhile, the pointwise Bellman error can be bounded as (this result is stated in Lemma A.6)
[TABLE]
where .
As a result, applying regret decomposition accross different layers and bounding the martingale difference by Azuma-Hoeffding inequality (Lemma D.1), we have
[TABLE]
Due to our update rule based on , we have
[TABLE]
where the second inequality holds because of Lemma D.3 and our update rule. The third inequality is from elliptical potential lemma (Lemma D.4).
Finally, the regret bound results from plugging (6) into (5).
5.2 Lower bound
In this part, we sketch the proof of Theorem 4.2.
We construct a hard MDP case with zero inherent Bellman error (), which has deterministic transition kernel. Therefore, deploying some deterministic policy will lead to a deterministic trajectory, like pulling an “arm” in the multi-armed bandits (MAB) setting. We further show that the number of such “arms” is at least . Together with the lower bounds of switching cost in multi-armed bandits (Qiao et al., 2022), we can derive the lower bound under the low inherent Bellman error setting.
6 Extension to generalized linear function approximation
In this section, we consider low adaptive reinforcement learning with generalized linear function approximation (Wang et al., 2019). We show that the same “doubling schedule” for updating exploration policy (line 7 of Algorithm 1) can be leveraged under this setting, which enables the design of provably efficient algorithms. We begin with the introduction of generalized linear function approximation.
6.1 Problem setup
Different from the low inherent Bellman error setting which characterizes using linear functions, we use a function class of generalized linear models (GLMs) to model . We denote the dimension of feature map by and define .
Definition 6.1** (GLM (Wang et al., 2019)).**
For a known feature map and a known link function , the class of generalized linear models is .
Similar to Wang et al. (2019), we make the following standard assumption which is without loss of generality.
Assumption 6.2**.**
* is either monotonically increasing or decreasing. Furthermore, there exist absolute constants and such that and , for all .*
This assumption is naturally satisfied by the identical map and also includes other non-linear maps such as the logistic map .
To characterize completeness under this function class, Wang et al. (2019) assumes the function class is closed with respect to the Bellman operator (defined in Section 2). Similarly, we make the same optimistic closure assumption below. Note that for a fixed constant 444 will be set to depend polynomially on and ., the enlarged function class is defined as
[TABLE]
Then the optimistic closure assumption is stated below.
Assumption 6.3**.**
For all and , we have .
According to Proposition 1 of Wang et al. (2019), this assumption strictly generalizes the standard linear MDP setting by allowing link functions with more expressivity.
6.2 Low switching algorithm
We present our Algorithm 2 below. Intuitively speaking, the algorithmic idea is to apply doubling schedule to Algorithm 1 of Wang et al. (2019). Similar to Algorithm 1, we only update the exploration policy when the “information gain” with respect to some layer has doubled (line 7). When the policy is updated, the LSVI step calculates an estimate of (the parameter w.r.t. the real function) iteratively from the -th layer to the first layer through minimizing (7). Then the optimistic value function is constructed by adding a bonus term to the empirical estimate (line 11). Finally, the greedy policy is deployed for collecting data (line 12, 18).
6.3 Main results of Algorithm 2
In this part, we state the main results about Algorithm 2. We begin with the upper bounds for regret and switching cost. Recall that we still assume almost surely, while represents the dimension of the feature map.
Theorem 6.4** (Main results).**
The global switching cost of Algorithm 2 is bounded by . In addition, with probability , the regret of Algorithm 2 over episodes is bounded by
[TABLE]
The proof of Theorem 6.4 is deferred to Appendix C due to space limit, below we discuss several interesting aspects of Theorem 6.4.
Near-optimal switching cost. Our algorithm achieves a switching cost that depends logarithmically on , which improves the switching cost of Algorithm 1 in Wang et al. (2019). We also prove the following information-theoretic limit which says that the switching cost of Algorithm 2 is optimal up to logarithmic factors.
Theorem 6.5** (Lower bound for no-regret learning).**
For any algorithm with sub-linear regret bound, the global switching cost is at least .
Theorem 6.5 is adapted from the lower bound for global switching cost under linear MDP (Gao et al., 2021), and we leave the proof to Appendix C.
Generalization over previous results. The closest result to our Algorithm 2 is the fully adaptive Algorithm 1 of Wang et al. (2019), which achieves the same regret bound. In comparison, our Algorithm 2 favors near optimal global switching cost at the same time, which saves computation and accelerates the learning process.
When applying our Algorithm 2 to the linear MDP case, our Theorem 6.4 will imply a regret bound of 555The identical link function corresponds to and . In addition, due to rescaling of reward functions, there will be an additional factor in the regret bound of Theorem 6.4. () and a global switching cost of , which recovers the results in Gao et al. (2021); Wang et al. (2021). Therefore, our result can be considered as generalization of these two results since GLMs allow more general function classes.
7 Conclusion and future work
This paper studied the well motivated problem of online reinforcement learning with low switching cost. Under linear Bellman-complete MDP with low inherent Bellman error, we designed an algorithm (Algorithm 1) with near optimal regret bound of and global switching cost bound of . In addition, we prove a (nearly) matching global switching cost lower bound for any algorithm with sub-linear regret. At the same time, we leverage the same “doubling trick” under the generalized linear function approximation setting, and designed a sample-efficient algorithm (Algorithm 2) with near optimal switching cost.
Although being more general than linear MDP, the two settings we consider are not the most general ones. The low Bellman eluder dimension setting (Jin et al., 2021) and MDP with differentiable function approximation (Zhang et al., 2022a) can be considered as generalization of the two settings in this paper, respectively. Therefore, our results can be considered as a middle step towards low switching reinforcement learning under more general MDP settings. For further extension, it will be interesting to find out whether low switching cost RL is possible under more general MDP settings (e.g., low Bellman eluder dimension (Jin et al., 2021), differentiable function class (Zhang et al., 2022a; Yin et al., 2023)), and we leave these as future work.
Acknowledgments
The research is partially supported by NSF Award #2007117.
Appendix A Proof of Theorem 4.1
In this section, we prove our main theorem. We first restate Theorem 4.1 below, and then prove the bounds for switching cost and regret in Section A.1 and Section A.2, respectively.
Theorem A.1** (Restate Theorem 4.1).**
The global switching cost of Algorithm 1 is bounded by . In addition, with probability , the regret of Algorithm 1 over episodes is bounded by
[TABLE]
A.1 Proof of switching cost bound
Proof of switching cost bound.
Let be the episodes where the algorithm updates the policy, and we also define .
According to the update rule (line 7 of Algorithm 1), for all , there exists some such that
[TABLE]
In addition, for all , we have
[TABLE]
Combining these two results, we have for all ,
[TABLE]
Therefore, it holds that
[TABLE]
where the first inequality is because of Lemma D.2 and our choice that . The second inequality is due to recursive application of (8). The last equation holds since we have for all .
Solving (9), we have , and therefore the proof is complete. ∎
A.2 Proof of regret bound
We first state some technical lemmas from Zanette et al. [2020]. We begin with the following bound on failure probability.
Lemma A.2** (Lemma 2 of Zanette et al. [2020]).**
With probability at least , for all , , ,
[TABLE]
where .
Next, we define the “best” feasible parameters that well approximate the values, and such parameters are going to be a feasible solution for the optimization problem (Definition 3.1). Then we state the accuracy bound of .
Definition A.3** (Best feasible approximant, Definition 4 of Zanette et al. [2020]).**
We recursively define the best approximant parameter for as:
[TABLE]
with ties broken arbitrarily and .
Lemma A.4** (Accuracy Bound of , Lemma 6 of Zanette et al. [2020]).**
It holds that for all :
[TABLE]
For notational simplicity, for which is the solution of Definition 3.1, we denote . Besides, represents where is the solution at the -th episode. Similarly, and . In addition, let denote the last policy update before episode , for all .
Lemma A.5** (Optimism, Lemma 7 of Zanette et al. [2020]).**
Under the high probability case in Lemma A.2, if we choose , then , for all is a feasible solution of the optimization problem (Definition 3.1). Therefore, for all , the optimistic value function satisfies
[TABLE]
In addition to optimism, we also have the following upper bound of Bellman error.
Lemma A.6** (Bound of Bellman error, Lemma 1 of Zanette et al. [2020]).**
Under the high probability case in Lemma A.2, it holds that for all ,
[TABLE]
Now we are ready to present the regret analysis of Algorithm 1.
Proof of regret bound.
We prove based on the high probability case in Lemma A.2.
First of all, the regret over episodes can be decomposed as
[TABLE]
where the last inequality results from Lemma A.5.
Note that due to our choice of , it holds that for all ,
[TABLE]
where the inequality holds because of Lemma A.6.
Plugging (16) into (15), we have with probability ,
[TABLE]
where the second inequality is because of (16). The last inequality holds with high probability due to Azuma-Hoeffding inequality (Lemma D.1) and the fact that for any .
Finally, it holds that
[TABLE]
where the second inequality holds according to Cauchy-Schwarz inequality and the fact that is non-decreasing in . The third inequality results from Lemma D.3 and the fact that . The forth inequality is because of elliptical potential lemma (Lemma D.4). The fifth inequality is derived by the definition of (from Lemma A.5). The last inequality comes from direct calculation.
The regret analysis is complete. ∎
Appendix B Proof of Theorem 4.2
In this section, we prove our lower bound of switching cost.
Theorem B.1** (Restate Theorem 4.2).**
Assume that the inherent Bellman error and for all , for any algorithm with sub-linear regret bound, the global switching cost is at least .
We first briefly discuss about our assumptions. We assume zero inherent Bellman error (i.e. ) since it is possible to derive sub-linear regret bounds only if , and we want to derive lower bounds of switching cost for algorithms with sub-linear regret. Otherwise, the regret bound will always be linear in . Also, the assumption on for all is without loss of generality.
Proof of Theorem B.1.
We first construct an MDP with two states, the initial state and the absorbing state .
For absorbing state , the choice of action is only , while for initial state , the choice of actions at layer is . Then we define the -dimensional feature map for the -th layer:
[TABLE]
where for (), the -th element is while all other elements are [math].
We now define the transition kernel and reward function as , , , for all . Besides, , for all and , where ’s are unknown non-zero values. Note that such MDP has zero inherent Bellman error () since the function class includes all possible Q-value functions.
Therefore, for any deterministic policy, the only possible case is that the agent takes action and stays at for the first steps, then at step the agent takes action () and transitions to with reward , later the agent always stays at with no more reward. For this trajectory, the total reward will be . Also, for any deterministic policy, the trajectory is fixed, like pulling an “arm” in multi-armed bandits setting. Note that the total number of such “arms” with non-zero unknown reward is at least due to our assumption that . Even if the transition kernel is known to the agent, this MDP is still as difficult as a multi-armed bandits problem with arms. Together will Lemma B.2 below, the proof is complete. ∎
Lemma B.2** (Lemma H.4 of Qiao et al. [2022]).**
For any algorithm with sub-linear regret bound under -armed bandit problem, the switching cost is at least .
Appendix C Proof for Section 6
In this section, we prove the theorems regarding our Algorithm 2 under the generalized linear function approximation setting. We begin with the upper bounds for switching cost and regret.
C.1 Proof of upper bounds
Theorem C.1** (Restate Theorem 6.4).**
The global switching cost of Algorithm 2 is bounded by . In addition, with probability , the regret of Algorithm 2 over episodes is bounded by
[TABLE]
Proof of switching cost bound.
Since the feature map in Algorithm 2 satisfies that for all , , we have . Therefore, the conclusion of Lemma D.2 still holds, with for all . In addition, because our policy update rule (line 7 of Algorithm 2) is identical to Algorithm 1, the upper bound of switching cost results from identical proof as in Section A.1, with all replaced by . ∎
Before we prove the upper bound of regret, we state some technical lemmas from Wang et al. [2019].
Lemma C.2** (Corollary 3 of Wang et al. [2019]).**
We denote the estimated Q value function of layer at the -th episode by . Suppose there exists a function such that for all ,
[TABLE]
(where is Bellman operator) and the policy is the greedy policy with respect to , then with probability at least ,
[TABLE]
Lemma C.2 is a standard regret decomposition which will be used to bound the regret of Algorithm 2. Below we give a valid choice of the confidence bound . Note that we define to be the last policy update before episode , for all . Therefore, for all .
Lemma C.3** (Adapted from Lemma 6 of Wang et al. [2019]).**
With probability , it holds that for all ,
[TABLE]
where is defined in Algorithm 2.
Therefore, optimism is straightforward.
Lemma C.4** (Corollary 5 of Wang et al. [2019]).**
Under the high probability case in Lemma C.3, for all , .
Combining optimism (Lemma C.4) with Lemma C.3, we have that in Algorithm 2 satisfies condition (19) with . Below we bound the summation of bonus.
Lemma C.5**.**
Assume that , then it holds that
[TABLE]
Proof of Lemma C.5.
[TABLE]
where the first inequality holds according to Cauchy-Schwarz inequality. The second inequality results from Lemma D.3 and the fact that . The third inequality is because of elliptical potential lemma (Lemma D.4). ∎
Now we are ready to present the proof of the regret upper bound.
Proof of regret upper bound.
The final regret upper bound is derived by combining Lemma C.2, Lemma C.5 and the definition that . ∎
C.2 Proof of lower bound
Finally, we present the proof of the lower bound.
Theorem C.6** (Restate Theorem 6.5).**
For any algorithm with sub-linear regret bound, the global switching cost is at least .
Proof of Theorem C.6.
Since linear MDP is a special case of generalized linear function approximation, the lower bound of global switching cost in Gao et al. [2021] holds here. ∎
Appendix D Assisting technical lemmas
Lemma D.1** (Azuma-Hoeffding inequality).**
Let be a martingale difference sequence such that for some . Then with probability at least , it holds that:
[TABLE]
Lemma D.2** (Lemma C.1 of Wang et al. [2021]).**
Let be as defined in Algorithm 1. Then for all and , we have .
Lemma D.3** (Lemma 12 of Abbasi-Yadkori et al. [2011]).**
Suppose are two positive definite matrices satisfying that , then for any , we have
[TABLE]
Lemma D.4** (Elliptical Potential Lemma, Lemma 26 of Agarwal et al. [2020]).**
Consider a sequence of positive semi-definite matrices with and define . Then
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems , pages 2312–2320, 2011.
- 2Afsar et al. [2021] M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ar Xiv preprint ar Xiv:2101.06286 , 2021.
- 3Agarwal et al. [2020] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems , 33:20095–20107, 2020.
- 4Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems , 32, 2019.
- 5Gao et al. [2021] Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. ar Xiv preprint ar Xiv:2101.00494 , 2021.
- 6Gao et al. [2019] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems , 32, 2019.
- 7Huang et al. [2022] Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, and Tie-Yan Liu. Towards deployment-efficient reinforcement learning: Lower bound and optimality. In International Conference on Learning Representations , 2022.
- 8Jiang et al. [2017] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning-Volume 70 , pages 1704–1713, 2017.
