Provably Efficient Reinforcement Learning via Surprise Bound
Hanlin Zhu, Ruosong Wang, Jason D. Lee

TL;DR
This paper introduces a provably efficient reinforcement learning algorithm that leverages a surprise bound to achieve near-optimal regret with fewer ERM problems, applicable to general value function approximation.
Contribution
The paper presents a new RL algorithm with theoretical guarantees that is computationally efficient and works with general value function classes satisfying Bellman-completeness.
Findings
Achieves $ ilde{O}( ext{poly}( ext{iota} H)\sqrt{T})$ regret bound.
Requires only $O(H \log K)$ ERM problems, significantly fewer than previous methods.
Effective in both linear and high-dimensional sparse linear settings.
Abstract
Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximations. We show that if the value functions can be approximated by a function class that satisfies the Bellman-completeness assumption, our algorithm achieves an regret bound where is the product of the surprise bound and log-covering numbers, is the planning horizon, is the number of episodes and is the total number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
Provably Efficient Reinforcement Learning via Surprise Bound
Hanlin Zhu Department of Electrical Engineering and Computer Sciences, UC Berkeley. [email protected]
Ruosong Wang Paul G. Allen School of Computer Science & Engineering, University of Washington. [email protected]
Jason D. Lee Electrical and Computer Engineering, Princeton University. [email protected]
Abstract
Value function approximation is important in modern reinforcement learning (RL) problems especially when the state space is (infinitely) large. Despite the importance and wide applicability of value function approximation, its theoretical understanding is still not as sophisticated as its empirical success, especially in the context of general function approximation. In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximations. We show that if the value functions can be approximated by a function class which satisfies the Bellman-completeness assumption, our algorithm achieves an regret bound where is the product of the surprise bound and log-covering numbers, is the planning horizon, is the number of episodes and is the total number of steps the agent interacts with the environment. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting. Moreover, our algorithm only needs to solve empirical risk minimization (ERM) problems, which is far more efficient than previous algorithms that need to solve ERM problems for times.
1 Introduction
Modern Reinforcement Learning (RL) problems are often challenging due to the huge state spaces, and in practice, value function approximation schemes are usually employed to tackle this issue. Empirically, combining various reinforcement learning algorithms with function approximation schemes has led to tremendous success on various tasks (Mnih et al., 2013, 2015; Silver et al., 2017). However, despite the great empirical success, our theoretical understanding of RL with function approximation is still not as sophisticated as its empirical counterpart. Until recently, most existing theoretical work in RL has been focusing on the tabular setting or the linear setting (Azar et al., 2017; Jin et al., 2018; Yang and Wang, 2019; Wang et al., 2019; Du et al., 2019b, a; Agarwal et al., 2020; Wang et al., 2020a; Du et al., 2020; Jin et al., 2020; Zanette et al., 2020; Li et al., 2020), while in practice, complex function approximators like neural networks are usually employed. Over the years, understanding conditions on the function class that permit sample-efficient RL has evolved into an important open research problem in machine learning theory.
Existing provably efficient RL algorithms that can handle general function approximation (Jiang et al., 2017; Sun et al., 2019; Ayoub et al., 2020; Jin et al., 2021; Du et al., 2021) usually require solving computationally intractable optimization problems and are therefore computationally inefficient. Recently, Wang et al. (2020b) proposed a provably efficient RL algorithm with general function approximation for function classes with bounded eluder dimensions. The algorithm by Wang et al. (2020b) is based on Least Squares Value Iteration (LSVI) and the principle of “optimism in the face of uncertainty”. There are two shortcomings in the work of Wang et al. (2020b). First, in order to calculate the exploration bonus, their algorithm applies sensitivity sampling (Langberg and Schulman, 2010; Feldman and Langberg, 2011; Feldman et al., 2013) to reduce the size of the replay buffer. Using a replay buffer with bounded complexity to calculate the exploration bonus is crucial for the correctness of their algorithm. On the other hand, such a step is complicated in nature and could be hard to implement in practice. Therefore, to make the algorithm practical, it is much more desirable to use simpler dimensionality reduction techniques (like uniform sampling) without sacrificing the theoretical guarantee. Second, as mentioned in Foster et al. (2018), showing examples with a small eluder dimension beyond linearly parameterized functions is challenging. In addition, taking the worst-case over all histories, as in the definition of the eluder dimension, is usually overly pessimistic in practice. In contextual bandits, it is known that provable efficiency can be established by assuming distributional conditions on the problem. For example, Foster et al. (2018) establishes regret bound for an optimism-based contextual bandits algorithm by assuming bounded surprise bound. It is natural to ask whether similar conditions can be used to establish provable efficiencies of RL algorithms.
Recently, Foster et al. (2020) established instance-dependent regret bounds for contextual bandits and reinforcement learning problems by assuming a bounded disagreement coefficient, which is a distribution-dependent assumption. Foster et al. (2020) show that the disagreement coefficient is always upper bounded by the eluder dimension of the function class. The RL algorithm in Foster et al. (2020), which is also based on Least Squares Value Iteration (LSVI) and the principle of “optimism in the face of uncertainty”, has two drawbacks. First, their algorithm achieves provable guarantees only in the block MDP setting which might not be realistic in practice. Second, when calculating the exploration bonus, their algorithm uses the star hull to reduce the complexity of the replay buffer, which is also complicated in nature and therefore difficult to implement in practice.
In this paper, we develop a novel provably efficient RL algorithm with general function approximation. Similar to previous algorithms (Wang et al., 2020b; Foster et al., 2020), our algorithm is an optimistic version of LSVI. Compared to previous ones, our algorithm has the following advantages:
- •
The regret bound of our algorithm is based on a variant of surprise bound proposed in (Foster et al., 2018), which is a distribution-dependent quantity and could therefore be smaller than the eluder dimension which considers the worst-case over all histories. Moreover, our theory does not rely on the block MDP assumption. Furthermore, the surprise bound can be upper bounded in the tabular setting, the linear setting and the high dimensional sparse linear setting, which implies our algorithm achieves reasonable regret bound in all these three settings.
- •
The dimensionality reduction technique for reducing the complexity of the replay buffer is based on uniform sampling. This is much simpler than the sensitivity sampling framework in Wang et al. (2020b) and the method based on star hull in Foster et al. (2020).
- •
Our algorithm requires solving only empirical risk minimization (ERM) problems, while previous algorithms (Wang et al., 2020b; Foster et al., 2020) require solving ERM problems.
1.1 Related work
Tabular reinforcement learning.
Tabular RL is well studied in the context of sample complexity and regret bound in numerous literature (Kearns and Singh, 2002; Kakade, 2003; Strehl et al., 2006, 2009; Jaksch et al., 2010; Azar et al., 2013; Lattimore and Hutter, 2014; Dann and Brunskill, 2015; Agrawal and Jia, 2017; Azar et al., 2017; Jin et al., 2018; Dann et al., 2019; Zanette and Brunskill, 2019; Zhang et al., 2020; Wang et al., 2020a; Yang et al., 2021). In particular, for episodic MDP without further assumptions, the best regret bound is for both model-based (Azar et al., 2017) and model-free (Zhang et al., 2020) algorithms, which matches the lower bound proved by Jin et al. (2018). Recently, Yang et al. (2021) propose an RL algorithm with a regret bound of assuming the existence of a positive sub-optimality gap. However, all algorithms mentioned above cannot be applied to RL problems with huge or infinite state spaces due to the polynomial dependence on in the regret bound. Therefore, in this paper, we assume the value function lies in a function class with bounded complexity and design a provably efficient algorithm whose regret bound depends polynomially on the complexity of the function class instead of the size of the state space.
Bandits.
There is also rich literature studying stochastic (contextual) bandits, which can be viewed as a special case of MDP without state transitions (Auer, 2002; Dani et al., 2008; Li et al., 2010; Rusmevichientong and Tsitsiklis, 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Foster et al., 2018, 2020; Li et al., 2019). In particular, Foster et al. (2018) study contextual bandit problems with general value function approximation, and prove their algorithms could achieve a regret bound depending polynomially on the surprise bound and the implicit exploration coefficient (IEC). In this paper, we study RL with general value function approximation, and prove that the regret bound of our algorithm also depends on the (slightly modified) surprise bound as well as the log-covering numbers. However, we note that the RL setting is much more complicated than the contextual bandits setting since there is no state transition in bandit problems.
Reinforcement learning with function approximation.
In the setting of linear function approximation, there has been great interest recently in the theoretical analysis of the sample complexity of RL algorithms (Yang and Wang, 2019, 2020; Jin et al., 2020; Cai et al., 2020; Du et al., 2019b, 2020; Wang et al., 2019; Zanette et al., 2020; Zhou et al., 2021). Compared to linear function approximation, however, many current provably efficient algorithms for general value function approximation are relatively impractical. For example, algorithms in Jiang et al. (2017); Sun et al. (2019); Dong et al. (2020) achieve regret bound in terms of the witness rank or the Bellman rank, but they are not computationally efficient. Foster et al. (2020) devise RegRL algorithm which is both computationally and statistically efficient. However, it requires the block MDP assumption which greatly alleviates the difficulty of (infinitely) huge state space and might not be realistic in practice. Ayoub et al. (2020) propose a model-based algorithm and Wang et al. (2020b) propose a model-free algorithm for general value function approximation, and the regret bound of both algorithms depend on the eluder dimension. Kong et al. (2021) propose an efficient algorithm both computationally and statistically for general value function approximation, of which the regret bound also depends on the eluder dimension. However, the eluder dimension considers the worst-case over all histories and is thus often overly pessimistic. Instead, the regret bound of our algorithm depends polynomially on the surprise bound which is a distribution-dependent quantity and thus could be smaller than the eluder dimension for practical scenarios.
2 Preliminaries
In this paper, we study episodic Markov Decision Process (MDP) , where is the state space, is the finite action space, is the planning horizon, is the transition kernel which maps a state-action pair to a distribution over the state space, is the reward function and is the initial state distribution 111Our analysis can be naturally extended to the time-inhomogeneous settings where the reward function and the transition kernel are different for each ..
A (stochastic) policy
[TABLE]
maps any state to a distribution over the action space at each step , where we use to denote the set for any positive integer . A trajectory
[TABLE]
is induced by a policy if , and . Furthermore, a policy is deterministic if for each step , maps a state to only one action.
For any policy , the expected cumulative reward starting from state at step is defined as the value function
[TABLE]
where we use superscript to denote that the trajectory is induced by . Similarly, the expected cumulative reward starting from state-action pair at step is defined as the -function
[TABLE]
Let denote the optimal policy which maximizes . Also, let and .
The agent interacts with the environment for episodes. At the beginning of each episode , the agent specifies a policy based on previous trajectories and interacts with the environment using for steps. We assume the agent knows the number of episodes , and we define to be the total number of steps that the agent interacts with the environment. The regret of an algorithm after episodes is defined as
[TABLE]
which compares the accumulated rewards between the agent’s policy and the optimal policy. The goal of the agent is to minimize the regret. In this paper, we consider the typical regime that is fixed while grows to infinity.
Width function and norms.
For notation convenience, we define the width function for any function class and several norms for any function . The width function is defined as
[TABLE]
. For any dataset and , define -norm
[TABLE]
-norm
[TABLE]
and infinite norm
[TABLE]
respectively. In addition, define for any .
Additional notations for algorithms.
For any finite multiset , let denote the uniform distribution over and denote the number of distinct elements in . For any , let denote the integer part of and define if is not an integer and otherwise . We use the standard notations to hide constants and use to suppress log factors. Also, we use to denote that there exists a constant s.t. , and use if .
3 Algorithm
In this section, we first introduce the assumptions for the algorithm and then present our main algorithm (Algorithm 1). The theoretical guarantee of our algorithm is presented in Section 4.
3.1 Assumptions
Assume our algorithm (Algorithm 1) receives a function class as part of the input. Since the complexity of determines the efficiency of the algorithm, it is natural and necessary to require bounded complexities of the function class under appropriate measures. We make the following assumptions on the function class .
Assumption 1** (Bellman-completeness).**
For any function , there exists a function , s.t.
[TABLE]
Assumption 1 indicates the closedness under Bellman equations. This is a general assumption that summarizes many previous assumptions in special settings and is commonly adopted in previous literature for general value function approximation (Wang et al., 2020b; Foster et al., 2020; Kong et al., 2021). For tabular RL, can be chosen as the set of all functions mapping from to . In the linear MDP setting (Bradtke and Barto, 1996; Jin et al., 2020; Yang and Wang, 2019, 2020; Wang et al., 2019) where the transition kernel and the reward function are both linear in a feature map , can be the set of all linear functions with respect to . In sparse high-dimensional linear MDP settings where the transition kernel and the reward function are both -sparse linear functions in , can be the set of all -sparse linear functions with respect to . Furthermore, Assumption 1 approximately holds in practice as long as is rich enough (e.g., deep neural networks) and we show in Section 5 that our algorithm is robust to model misspecification.
Assumption 2** (Bounded covering number).**
Given any , there exist covering sets and with bounded size and respectively, where
- •
, s.t. .
- •
, s.t. .
Assumption 2 requires bounded covering numbers for both and , and the regret bound of our algorithm depends only logarithmically on the covering numbers (Theorem 1). In the tabular RL setting, and . In -dimensional linear MDP settings, and . In -sparse high-dimensional linear MDP settings, . If we further assume that is -sparse for all , then .
Surprise bound.
Another important complexity measure in this paper is surprise bound, which was first introduced in Foster et al. (2018) to characterize the complexity of the function class in the contextual bandit setting.
Definition 1** (Surprise bound).**
The surprise bound is the smallest positive constant s.t.
[TABLE]
for all and any policy , where is the distribution of when the policy is .
Intuitively, the surprise bound is small if all pairs of functions with a small expected squared error with respect to any policy, do not encounter a much larger squared error on any state-action pair. The following proposition gives upper bounds of the surprise bound for linear and sparse linear settings (see Appendix C for the proof).
Proposition 3.1**.**
In the (sparse) linear MDP setting with a fixed feature map , consider the function class for some .
- •
If and , then
[TABLE]
- •
If and , then
[TABLE]
where is the minimum restricted eigenvalue for -sparse predictors (Raskutti et al., 2010).
3.2 Algorithm
In this section, we present our main algorithm (Algorithm 1) and discuss in detail several important components of our algorithm.
3.2.1 Doubling epoch schedule
Our algorithm consists of epochs where each epoch starts at the beginning of episode and consists of episodes. Thus, the total number of episodes and . At the beginning of epoch , the algorithm fixes a policy and the agent executes for all episodes . The epochs can be divided into two phases.
- •
Phase 1: Warm-up epochs. For the first epochs, the agent plays a uniformly random policy. These warm-up epochs are designed to encourage exploration at the initial episodes.
- •
Phase 2: Optimistic LSVI. Starting from epoch , we use an optimistic version of Least Squares Value Iteration (LSVI) similar to Jin et al. (2020); Wang et al. (2019, 2020b); Foster et al. (2020). At the beginning of each epoch , we maintain all previous trajectories as a replay buffer, and find the best fit with respect to the replay buffer in the sense of mean squared error (MSE), i.e.,
[TABLE]
where is the replay buffer (see definition in Algorithm 1). To avoid overfitting and encourage exploration, we design a bonus function which we will discuss later in Section 3.2.2, and approximate the optimal function by
[TABLE]
Our design of the bonus function ensures that is an optimistic estimator of with high probability (Lemma 6). Finally, for each episode in epoch , the agent plays the greedy policy with respect to and collect the trajectory in episode .
The advantages of the doubling epoch schedule are two folded:
- •
Computationally efficient. Since our algorithm only conducts large amount of computation at the beginning of each epoch (computing by empirical risk minimization and by the width function as in Section 3.2.2, which can often be solved efficiently by appropriate optimization methods or assuming access to appropriate regression oracles (Wang et al., 2020b; Foster et al., 2018)) and there are only epochs, our algorithm is much more computationally efficient than previous methods (Wang et al., 2020b; Foster et al., 2020) which require to solve equivalent optimization problems.
Recently, Kong et al. (2021) proposes an online sub-sampling technique which improves the computational complexity of Wang et al. (2020b). However, our algorithm is still much more computationally efficient than Kong et al. (2021). The algorithm of Kong et al. (2021) adopts sensitivity sampling, which requires computing sensitivities for each state action pair . Since the calculation of sensitivity requires solving a regression oracle for times (see Section 4.4. in Kong et al. (2021)), and there are such state-action pairs, their algorithm needs to solve regression oracles to calculate sensitivities and subsample the dataset. While in our algorithm, we use uniform sampling to avoid the complex and time-consuming sensitivity calculation and thus does not need any oracle to perform the subsampling procedure.
- •
Stabilizing adjacent trajectories. The doubling epoch schedule together with the warm-up epochs stabilizes the adjacent trajectories by ensuring that at the beginning of each epoch, at least half of the historical trajectories in the replay buffer are induced by the same policy. This property enables us to adopt uniform sampling (Algorithm 2) to reduce the complexity of the replay buffer.
3.2.2 Uniform sampling
An important technical novelty of our algorithm is the design of the bonus function via uniform sampling. To ensure optimism of our estimator , we can choose as the upper bound of the difference between and . If we are able to obtain a confidence region which contains both and , it suffices to define the bonus function as the width function of .
A naive way to choose the confidence region is with a carefully selected . However, since the confidence region depends on the whole replay buffer with size at most , the confidence region and thus the bonus function would suffer extremely high complexity. This implies that needs to be set extremely large to ensure the accuracy of the confidence region. To obtain a bonus function with low complexity, we reduce the complexity of the replay buffer by uniform sampling, which is formally stated in Algorithm 2.
Comparison to previous methods.
Actually, the algorithms in Wang et al. (2020b); Foster et al. (2020) also suffer the high complexity of the bonus function and address the issue by sensitivity sampling and star hull respectively. However, sensitivity sampling requires estimating the sensitivity of each state-action pair, which is time-consuming; the star hull is complicated in nature and thus is hard to implement in practice. In contrast, our uniform sampling is conceptually simple and easy to implement. Note that there is only one single parameter to be determined in Algorithm 2. When the surprise bound is known in advance, we can directly calculate the value of . When is unknown, we can perform a grid-search in a log-space of . Specifically, we can set a small value as the lower bound of and a large value as the upper bound, and perform Algorithm 1 for . Then we can pick the policy with the best performance under different choices of .
Theorem 1 shows that the regret of our main algorithm (Algorithm 1) is in dependence. We also emphasize that the above grid-search procedure won’t result in higher total regret, since one can first try each possible for times, and then exploit the best for the remaining steps. The resulting total regret is still .
Design of the bonus function via uniform sampling.
Now we are able to design a bonus function with low complexity as in Algorithm 3 via uniform sampling. After obtaining the reduced dataset , we round each data in and the reference function to their nearest neighbors in covering sets. The confidence region and the bonus function is then defined by the rounded reference function and the rounded dataset. Note that in Algorithm 3, the rounding operation does not need to be performed explicitly since all the data are stored in computers with bounded precision, and thus all the data will be implicitly rounded. For the choice of , we can use the same grid-search method of since is also determined by .
Efficient computation of the bonus function.
The computation of the bonus function is equivalent to an optimization problem of the following form:
[TABLE]
This problem can be solved efficiently by either assuming access to an optimization oracle, or assuming access to only a regression oracle (which is a milder assumption than optimization oracles) as mentioned in Section 4.4 of Kong et al. (2021).
4 Theoretical results
In this section, we formally present our main theorem of the regret bound and defer the proof to Appendix B.
Theorem 1** (Main theorem).**
Under Assumption 1, 2, let where the number of total steps is sufficiently large. With probability at least , the regret of Algorithm 1 is at most
[TABLE]
where
Proof sketch.
In this proof sketch, we ignore the rounding operation in Algorithm 3 for convenience. The proof can be decomposed into three main steps.
- •
Step 1: Bounding the complexity of the bonus function. First, we show that our bonus function has low complexity (Proposition A.2). Note that the bonus function is defined as the width function of the confidence region
[TABLE]
Since the reduced dataset has bounded size (Lemma 1) and bounded number of distinct elements (Lemma 3), our bonus function which is defined by also has low complexity. Now it remains to show that the bonus function defined over the reduced dataset is (almost) the same as the bonus function defined over the original dataset . It is equivalent to show that the confidence region remains (almost) unchanged after uniform sampling. This can be proved by showing that for any function pairs , the -norm of approximates well the -norm of (Lemma 2). For a fixed function pair , is an unbiased estimator of and its variance can be controlled, since the trajectories in the replay buffer are stabilized by the doubling epoch and thus has low complexity after uniform sampling. Then we can apply the Bernstein inequality to a fixed function pair to show that is close to with high probability. Applying a union bound over all function pairs in the covering set of , we can obtain the desired result.
- •
Step 2: Optimism of the estimated -function. The next step is to show that the estimated -function is an optimistic version of the true -function of the optimal policy (Lemma 6). To achieve this, we need to show that the best fit is close to . If and are independent, a standard concentration argument concludes the result. However, and are subtly dependent since they are both determined by the previous dataset. To address the difficulty, we first apply the standard concentration result on a fixed (Lemma 4), and then apply a union bound over all in a covering set (Lemma 5) to obtain the result. This method is similar to Wang et al. (2020b).
- •
Step 3: Regret decomposition. Finally, we decompose the regret by the summation of the bonus functions (Lemma 7). Then, we use similar arguments as in Foster et al. (2018) to bound each bonus term by the surprise bound separately since the bonus function is defined as the (approximate) width function of the confidence region.
∎
Remark 1**.**
Recently, Foster et al. (2021) proposes a high-level algorithm E2D. When applying E2D algorithm to our settings, one can show that it also achieves a similar regret bound (other parameters omitted). However, we want to emphasize that E2D algorithm is too high-level to implement in practice. The implementation of E2D algorithm requires an online estimation oracle (see Algorithm 1 in Foster et al. (2021)), which is a very strong assumption in RL settings. While in our algorithm, we only require a ERM oracle and a regression oracle, which are mild and common assumptions in machine learning problems.
While our algorithm works for general value function class, it also achieves reasonable regret in special cases.
Tabular settings.
In the tabular RL setting, it holds that and . When and for all , for a (not too) small positive value , , which implies that the regret bound is . This is a reasonable regret bound since it is optimal in terms of , the most important term in the regret bound, and has polynomial dependency in other parameters.
Linear settings.
When is a -dimensional linear function class, we have . When
[TABLE]
is lower bounded (of order ) and thus by Proposition 3.1, the regret bound is , which is optimal in -dependency and matches the result of Wang et al. (2020b) in -dependency.
Sparse linear settings.
Furthermore, when is an -sparse high-dimensional linear function class where typically , we have . When
[TABLE]
is lower bounded (of order ) and thus is by Proposition 3.1, the regret bound is . If we further assume that is -sparse for all , we have and thus obtain an regret bound. However, directly applying the result in linear settings of Wang et al. (2020b) can only obtain a linear regret when . This shows the superiority of our algorithm since we can provide theoretical guarantee for more general function classes, and thus it is an important step toward studying general value function approximation beyond the tabular and linear settings.
We also emphasize a subtle difference between linear and sparse linear settings. In linear settings, when is lower bounded, we typically expect it to be of order since we assume the 2-norm . While for sparse linear settings, when is lower bounded, we typically expect it to be of order since we assume the infinity norm in this setting.
5 Model Misspecification
Our main theorem (Theorem 1) requires Bellman-completeness assumption (Assumption 1). Although the Bellman-completeness assumption is fairly common in theoretical analysis, especially in the presence of general value function approximation, the ground truth model together with the function class might slightly violate this assumption in real-world scenario. This phenomenon is known as model misspecification (Jin et al., 2020; Wang et al., 2020b).
In this section, we show that as long as the violation of the Bellman-completeness assumption is small, the regret of our algorithm is still bounded. To state the result formally, we first introduce the following assumption, which can be viewed as a model misspecification version of the Bellman-completeness assumption.
Assumption 3** (Model misspecification).**
There exists a constant satisfying that for any function , there exists a function , s.t.
[TABLE]
Under Assumption 3, one can directly apply Algorithm 1 to the model misspecification setting with only a different choice of the parameter in Algorithm 3. Specifically, for some constant we set
[TABLE]
Note that when Assumption 1 holds, it is equivalent to Assumption 3 with , and thus the parameter is exactly the same as the one in our original algorithm. The following theorem provides theoretical guarantees of our algorithm for model misspecification, and the proof is attached in Appendix D, which is very similar to the proof of Theorem 1.
Theorem 2** (Theoretical guarantee for model misspecification).**
Under Assumption 3, 2, let and the number of total steps . With probability at least , the regret of Algorithm 1 (where the parameter is defined as in (1)) is at most
[TABLE]
where
6 Conclusion
In this paper, we propose a provably efficient RL algorithm (both computationally and statistically) with general value function approximation. The regret bound of our algorithm depends on the surprise bound, which is a distribution-dependent quantity and could therefore be smaller than the eluder dimension considered in previous work. Our algorithm achieves reasonable regret bound when instantiating to special function classes.
As a future direction, it would be interesting to see if it is possible to establish the provable efficiency of RL algorithms using other distribution-dependent complexity measures. For example, it would be interesting to study whether it is possible to design a provably efficient RL algorithm by assuming a bounded disagreement coefficient (as in Foster et al. (2020)) but without the block MDP assumption.
Appendix A Analysis of the bonus function
In this section, we analyze our bonus function, and the main proposition is presented in Proposition A.2.
A.1 Analysis of Algorithm 2
Note that the notation in Algorithm 3 and Algorithm 2 are different. In this subsection, all the notation refer to in Algorithm 2, and therefore, . Also, let throughout this subsection.
We assume that the input dataset of Algorithm 2 is where more than half of the trajectories are induced by the same policy and the number of trajectories
[TABLE]
which is satisfied if and is chosen as in Theorem 1.
The first lemma gives an upper bound on the size of the dataset produced by uniform sampling.
Lemma 1**.**
With probability at least , .
Proof.
We define random variable
[TABLE]
Since and , we can obtain
[TABLE]
by Markov inequality. ∎
The next lemma proves that after uniform sampling, the norms of difference of any function pairs are approximately preserved with high probability.
Lemma 2**.**
With probability at least , for any ,
[TABLE]
Proof.
When , , the result directly holds. So we only consider the case when , which means
[TABLE]
We separately consider the cases when and .
For any function pair where , conditioned on the event in Lemma 1 which holds with probability at least , we can obtain that Also, by the fact that and , we can conclude that
[TABLE]
In the remaining part of the proof, we consider the case that .
We first fix any pair of distinct functions . Assume the first trajectories are all induced by the same policy . Also, for any , let
[TABLE]
Therefore,
[TABLE]
Note that
[TABLE]
Also, by Definition 1,
[TABLE]
Therefore, by Hoeffding’s inequality,
[TABLE]
Setting , we can obtain
[TABLE]
Let denote the event that
[TABLE]
then .
Now, we condition on for the following analysis. For each , define
[TABLE]
Obviously, and . Also,
[TABLE]
Moreover,
[TABLE]
Then, by Azuma-Bernstein’s Inequality,
[TABLE]
Since the above inequality holds conditioned on , if we do not condition on ,
[TABLE]
By union bound, the inequality above implies that with probability at least , for any ,
[TABLE]
Denote the event above and the event in Lemma 1 by , where
[TABLE]
Now we condition on where . For any function pair where , there exists , s.t.
[TABLE]
Therefore,
[TABLE]
by . Then we can obtain that
[TABLE]
By similar methods, we can also obtain that
[TABLE]
∎
We also give the bound of the number of distinct elements in .
Lemma 3**.**
With probability at least ,
Proof.
First, note that
[TABLE]
since for any , there must exists s.t. is an integer.
When , which means and
[TABLE]
we have
[TABLE]
When , we have . Now, For each , define
[TABLE]
Then the number of distinct elements in is upper bounded by . Since ,
[TABLE]
By Chernoff bound,
[TABLE]
∎
A.2 Analysis of Algorithm 3
In this subsection, all the notation refer to in Algorithm 3. In other words, we replace all the in Section A.1 by . Also, we still assume that the input dataset of Algorithm 3 is where more than half of the trajectories are induced by the same policy and the number of trajectories satisfies
[TABLE]
which is satisfied if and is chosen as in Theorem 1.
Combining the three lemmas in Section A.1 with a union bound, we can obtain the following proposition.
Proposition A.1**.**
Let denote the dataset returned by Algorithm 2. With probability at least , , the number of distinct elements in does not exceed
[TABLE]
and for any ,
[TABLE]
By Proposition A.1, we can deduce the following proposition.
Proposition A.2**.**
For Algorithm 3, the following holds.
With probability at least ,
[TABLE]
where and . 2. 2.
There exists a function set s.t. and
[TABLE]
for some absolute constant when is sufficiently large.
Proof.
For the first part, we condition on the event defined in Proposition A.1. We only need to prove that , where is defined in Algorithm 3. For any , we have
[TABLE]
Therefore,
[TABLE]
This means for any , we have , which implies , i.e., . Similarly,
[TABLE]
So for any , we have , which implies , i.e., .
For the second part, since function is uniquely defined by , we only need to analyze the maximal number of different possible function classes . When or the number of distinct elements in is larger than
[TABLE]
and thus . Otherwise, is determined by and . Since , the number of different does not exceed . Moreover, since there are at most
[TABLE]
distinct elements in , where and each element belongs to , the number of different is upper bounded by
[TABLE]
∎
Appendix B Analysis of the main algorithm
Now we start to prove the regret bound of Algorithm 1. The following lemma provides a bound on the estimation of a single backup.
Lemma 4** (Single step optimization error).**
Consider a fixed epoch . We define
[TABLE]
as in Algorithm 1. Also, for any function , we define
[TABLE]
and
[TABLE]
Then, for any function and , there exists an event where , s.t. conditioned on , for any with , we have
[TABLE]
for some constant .
Proof.
For any , we define
[TABLE]
and now we consider a fixed . For any , define
[TABLE]
Also, for any , define as the filtration induced by
[TABLE]
Then we have and . Applying Lemma 10 of Kirschner and Krause [2018] by setting , we can obtain that with probability at least ,
[TABLE]
Applying a union bound of over all , we can further obtain that with probability at least ,
[TABLE]
holds for all .
Let denote the above event, and for the rest of the proof, we condition on .
Now, for any , there exists a function , s.t. . Therefore,
[TABLE]
For any with , we can obtain that
[TABLE]
Furthermore, for any ,
[TABLE]
If we let , since , we have
[TABLE]
which implies
[TABLE]
for some constant . ∎
Lemma 5** (Confidence region).**
In Algorithm 1, for , define confidence region
[TABLE]
Then with probability at least , for all ,
[TABLE]
given
[TABLE]
for some constant . Here, is given in Proposition A.2.
Proof.
By Proposition A.2, Note that
[TABLE]
is a -cover of
[TABLE]
i.e., there exists , s.t. . Therefore,
[TABLE]
is a -cover of with .
Now, for each , let denote the event defined in Lemma 4. By union bound, In the rest of the proof, we condition on the event .
Since , and there exists s.t. , by Lemma 4, we have
[TABLE]
for some constant . Applying a union bound over all , we have that with probability at least ,
[TABLE]
∎
The above lemma proves that the confidence region contains with high probability, which implies that all the estimated -function are optimistic with high probability as well. We formally state the conclusion in the next lemma.
Lemma 6** (Optimistic -function).**
With probability at least ,
[TABLE]
for all and .
Proof.
Let be the confidence region as defined in Lemma 5. Let denote the event that
[TABLE]
By Lemma 5, . Let denote the event that
[TABLE]
By Proposition A.2 and union bound over all , . We condition on in the rest of the proof, which holds with failure probability at most .
By the definition of width function,
[TABLE]
Since , we have
[TABLE]
Therefore, for all
[TABLE]
Next, we start to prove by induction on . When , the inequality directly holds since . Now for any , assume . This also implies . Therefore, for any ,
[TABLE]
which completes the proof. ∎
Now, we can decompose the regret and bound it by the summation of bonus functions.
Lemma 7** (Regret decomposition).**
With probability at least ,
[TABLE]
Proof.
For any step , epoch and episode in epoch , define
[TABLE]
and define as the filtration induced by
[TABLE]
Then and . By Azuma-Hoeffding inequality, with probability at least ,
[TABLE]
We condition on both this event and the event defined in Lemma 6 which also holds with probability at least in the rest of the proof.
Let denote the uniformly random policy adopted in the first epochs. By Lemma 6,
[TABLE]
For each and corresponding , we have
[TABLE]
Therefore,
[TABLE]
∎
To prove the main theorem, we also need the next lemma.
Lemma 8**.**
With probability at least , for all and any ,
[TABLE]
Proof.
We first fix any . Define dataset
[TABLE]
Now we fix any pair of distinct functions . Also, for any episode , let
[TABLE]
Therefore,
[TABLE]
Note that
[TABLE]
Also, by Definition 1,
[TABLE]
Therefore, by Hoeffding’s inequality,
[TABLE]
Since
[TABLE]
by setting , we can obtain that
[TABLE]
By a union bound over all such function pairs , this implies that with probaiblity at least , for any ,
[TABLE]
Now we condition on the event above in the following part of the proof.
To simplify the notation, we denote
[TABLE]
For any pair of functions , there exists , s.t. and . When , we can directly obtain that
[TABLE]
So we only consider the case when . Then, we have
[TABLE]
Therefore,
[TABLE]
which means
[TABLE]
Finally, we complete the proof by directly applying a union bound over all .
∎
Now we are ready to prove the main theorem.
Proof of Theorem 1.
We condition on the event defined in Lemma 5, Lemma 6, Lemma 7 and Lemma 8. Also, we condition on the event in Proposition A.2 after applying a union bound over all . With probability at least , all the above events hold.
By Lemma 7, we have
[TABLE]
For any , we define
[TABLE]
where
[TABLE]
as defined in Algorithm 1. Let
[TABLE]
By Proposition A.2, . Then, for any episode ,
[TABLE]
Therefore,
[TABLE]
which implies
[TABLE]
Then, we can obtain that
[TABLE]
∎
Appendix C Proof of Proposition 3.1
In this section, we provide the proof of Proposition 3.1.
Proof of Proposition 3.1.
For linear settings, let , then by Definition 1,
[TABLE]
For sparse high-dimensional linear settings, let , then by Definition 1,
[TABLE]
∎
Appendix D Proof of Theorem 2
In this section, we provide the proof of Theorem 2 for model misspecification. First, we slightly modify Lemma 4 and reprove it in model misspecification case.
Lemma 9** (Single step optimization error for misspecification).**
Assume that our function class satisfies Assumption 3. Consider a fixed epoch . We define
[TABLE]
as in Algorithm 1. Also, for any function , we define
[TABLE]
and
[TABLE]
Then, for any function and , there exists an event where , s.t. conditioned on , for any with , we have
[TABLE]
for some constant .
Proof.
For any , we define
[TABLE]
and now we consider a fixed . Note that under Assumption 3, it does not necessary hold that , but it can be ensured that
[TABLE]
For any , define
[TABLE]
By the same method as in Lemma 4, we can prove that with probability at least ,
[TABLE]
Let denote the above event, and for the rest of the proof, we condition on .
Similarly, by the same method as in Lemma 4, for any , we have
[TABLE]
For any with , we can obtain that
[TABLE]
Furthermore, again by the same method as in Lemma 4, we can obtain that for any ,
[TABLE]
If we let , we have
[TABLE]
Now let , then
[TABLE]
Therefore,
[TABLE]
which implies
[TABLE]
for some constant . ∎
Using the above lemma, we can obtain the following lemma similar to Lemma 5.
Lemma 10** (Confidence region for misspecification).**
Assume that our function class satisfies Assumption 3. In Algorithm 1, for , define confidence region
[TABLE]
Then with probability at least , for all ,
[TABLE]
given
[TABLE]
for some constant . Here, is given in Proposition A.2.
Proof.
The proof is almost identical to that of Lemma 5. ∎
Proof of Theorem 2.
By Lemma 10, Lemma 6, Lemma 7, Lemma 8, the proof is almost the same as the proof of Theorem 1. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS , volume 11, pages 2312–2320, 2011.
- 2Agarwal et al. [2020] Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory , pages 67–83. PMLR, 2020.
- 3Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. ar Xiv preprint ar Xiv:1705.07041 , 2017.
- 4Auer [2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research , 3(Nov):397–422, 2002.
- 5Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning , pages 463–474. PMLR, 2020.
- 6Azar et al. [2013] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning , 91(3):325–349, 2013.
- 7Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning , pages 263–272. PMLR, 2017.
- 8Bradtke and Barto [1996] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning , 22(1):33–57, 1996.
