Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws
Kush Bhatia, Wenshuo Guo, Jacob Steinhardt

TL;DR
This paper develops a theoretical framework for reward learning using nonparametric models, deriving optimal experiment design strategies and scaling laws, with applications to Gaussian process bandit optimization.
Contribution
It introduces a nonparametric reward learning framework with optimal query design and provides new risk bounds and scaling laws, improving upon previous results.
Findings
Derived non-asymptotic excess risk bounds for nonparametric reward estimation.
Optimized query design based on risk bounds for improved sample efficiency.
Unified framework encompassing Gaussian process bandit optimization with improved guarantees.
Abstract
Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Advanced Bandit Algorithms Research · Machine Learning and Data Classification
MethodsGaussian Process
**Reward Learning as Doubly Nonparametric Bandits:
Optimal Design and Scaling Laws**
Kush Bhatia
Wenshuo Guo
Jacob Steinhardt
Stanford University
UC Berkeley
UC Berkeley
Abstract
Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special case of our framework, and that our bounds either improve or are competitive with known regret guarantees for the Matérn kernel.
1 Introduction
Specifying the reward function accurately for a desired objective, or reward engineering, is challenging to perform by hand, as the consequences of even small errors can be drastic (Hadfield-Menell et al., 2017). To address this, reward learning seeks to learn a predictive model of the reward function from data, which is obtained from carefully selected queries to human annotators. The learned reward model is then used as the optimization objective for policy learning. Reward learning has achieved significant empirical success in domains such as text summarization (Stiennon et al., 2020; Böhm et al., 2019), robot locomotion (Daniel et al., 2014), predicting driving styles (Kuderer et al., 2015), and Atari game playing (Christiano et al., 2017).
Despite their success, reward learning methods still lack theoretical grounding. Moreover, their behavior can be brittle even on simple tasks, due to the difficulty of choosing appropriate queries and due to feedback loops from adaptive querying (Freire et al., 2020). Indeed, an ablation study in Christiano et al. (2017) suggests that random queries can outperform or be competitive with adaptive query procedures. To address these issues, we provide a theoretical framework for analyzing reward learning, framing it as a doubly nonparametric experimental design problem. This framework helps elucidate the role of query selection (Chaloner and Verdinelli, 1995) and also enables us to derive scaling laws—how the sizes of the policy and reward models affect the query complexity—for reward learning (Kaplan et al., 2020).
Proposed framework.
In our framework, we suppose we are given a reward class and policy class . Our goal is to find a policy that performs well according to an unknown true reward . To do this, we query policies , observing noisy estimates of their true reward, and use this information to choose the eventual policy .
To be compatible with modern nonparametric learning methods (i.e. neural nets), we view and as subsets of Reproducing Kernel Hilbert Spaces (RKHS). A salient feature of our proposed framework is that the learner therefore optimizes a nonparametric reward function over a nonparametric space of policies, making the task “doubly” nonparametric. In contrast, previous work considers a nonparametric function class or reward class, but typically not both. For instance, nonparametric zeroth order or bandit optimization (Srinivas et al., 2010; Mockus, 2012; Wang et al., 2018) considers a nonparametric function on a finite-dimensional input space. Conversely, nonparametric supervised learning (Wahba, 1990; Hofmann et al., 2008) minimizes a known loss function over a nonparametric input space.
The doubly nonparametric nature of our task poses new challenges. The (possibly) infinite-dimensional RKHS requires the learner to select which subspace to explore given a finite number of queries. Furthermore, the unknown reward function makes it challenging for the learner to reason about the information gained from the selected query policies. We address these challenges by deriving a risk upper bound for a family of plug-in estimators based on ridge regression, and then optimizing this bound to solve the optimal design task. Our results show that the quality of the output policy depends on how well the query set is aligned with the eigenfunctions of the policy space.
In addition to the optimal design problem, our framework allows us to study scaling laws with respect to the reward (or policy) class by varying the rate of decay of their corresponding eigenspectrum. This decay rate determines the effective dimensionality of a RKHS (Zhang, 2002), and provides a natural proxy for varying the size of the reward or policy class. Qualitatively, our main results show that the excess risk asymptotically vanishes as long as the policy class grows at a slower rate relative to the reward class.
Sharpness of analysis. Our risk bounds apply to reward and policy classes of arbitrary or even infinite dimensionality. Despite this generality, we show they provide stronger guarantees than previous bounds for the specialized settings of compact policy sets and kernel multi-armed bandits.
In Section 4.3, we look at a special case of our problem when the policy set is a compact subspace and thus has finite rank. For these instances, we show that our learning algorithm obtains a better excess risk versus a rate of obtained by the adaptive GP-UCB algorithm (Srinivas et al., 2010), where is a power law decay rate.
In Section 5, we specialize our general results to the well-studied problem of Gaussian process bandit optimization (Williams and Rasmussen, 2006), also known as kernel multi-armed bandit (MAB). Specifically, for the class of Matérn kernels with parameter in dimensions, we show that our algorithm achieves a regret bound of which is strictly better than those achieved by the GP-UCB and GP-Thompson Sampling (GP-TS) (Chowdhury and Gopalan, 2017) algorithms and comparable with -GP UCB (Janz et al., 2020) and supKernelUCB (Valko et al., 2013; Vakili et al., 2021); see Table 1 for details. GP-UCB and GP-TS only yield sub-linear regret bounds when the smoothness of the kernel —thus in high dimensions, these bounds essentially become vacuous. The -GP UCB algorithm was designed specifically to overcome this issue. Our proposed algorithm achieves sublinear regret for all .
Our Contributions. We propose doubly-nonparametric bandits as a framework for theoretically studying the reward learning problem. Within this framework, we obtain finite sample risk bounds for a ridge regression based plug-in estimator and derive scaling laws for reward learning. From a technical standpoint, we study the optimal design problem for our estimator to select informative query points by showing that the excess risk depends only on the spectral properties of a certain operator of the two RKHSs and the empirical covariance matrix. As a corollary of our risk bounds, we provide sharper regret bounds for a class of kernel MAB problems compared to several existing algorithms, showing that the doubly-nonparametric lens of reward learning is fruitful even for “singly-nonparametric” tasks. To obtain these bounds, our reduction carefully constructs two different RKHSs to embed the input space and reward function into a policy and reward class.
2 Framework: Doubly nonparametric Bandits
Our framework considers non-parametric policy learning with non-parametric reward models. We let denote an arbitrary policy and denote an arbitrary reward function, where and are Reproducing Kernel Hilbert Spaces. For technical reasons, we assume the corresponding kernel functions and both satisfy the Hilbert-Schmidt condition (see Appendix A for details).
We let denote the reward obtained by selecting policy under reward function and consider the case where the evaluation functional is linear in both and . In other words, where is a known linear mapping from the policy space to the reward space. Since and may be infinite-dimensional, linearity is only a weak restriction–e.g. the map is linear in for any RKHS.
To incorporate problem structure, we let denote the true reward function and assume that for some known set such that . We further assume that policies are restricted to lie in some which is a subset of the unit ball in (for instance, might incorporate physical constraints on implementable policies). Thus, given the true reward , the optimal policy (for a compact ) is . This proposed framework, which allows for infinite-dimensional policy as well as reward classes, allows us to study how both the policy and reward space affect the difficulty of learning.
Query access to reward . The true reward function is unknown to the learner but is accessible via queries to an oracle (e.g. a human expert), which provide noisy zeroth-order (or bandit) evaluations of the reward . When queried with a policy , the oracle provides a response
[TABLE]
with denoting the variance of the response. There are two possible query models: passive queries (Atkinson, 1996; Sebastiani and Wynn, 2000), where the learner selects all queries at the same time, and active queries (Bubeck et al., 2011; Lattimore and Szepesvári, 2020), where the learner is allowed to select queries sequentially. Our focus in this work will be on the passive query model, but in many cases we will outperform existing active query algorithms.
Problem statement. Given passive access to the oracle , the objective of the learner is to output a policy that has small excess risk , defined as
[TABLE]
We think of queries to the oracle as expensive, and are interested in achieving low excess risk with as few queries as possible. This notion of excess risk is also studied by the term simple regret in pure exploration bandit problems (Lattimore and Szepesvári, 2020).
Representations in . By Mercer’s theorem, we can represent any RKHS as a subset of . Formally, the policy and the reward spaces are isomorphic to the ellipsoids
[TABLE]
for appropriately chosen eigenfunctions and , and corresponding eigenvalues and (Wainwright, 2019). These are defined with respect to a base measure over the input domain; see Appendix A for details. With a slight abuse of notation, going forward, we will use and to denote the corresponding coefficients and in the expansion above. 111While the eigenfunctions and can be different, this representation can still be used by modifying the map appropriately. This is detailed in Appendix A. With this, the inner products associated with and simplify
[TABLE]
Also let and be diagonal matrices comprising the inverse of the eigenvalues of and . With this notation, if we view the map as a (infinite-dimensional) matrix, its Hermitian adjoint222Recall the Hermitian adjoint of satisfies is equal to .
In order for the evaluation functional to be finite for all , the operator norm must be bounded (see Appendix A). We will see later that the decay of this operator’s singular values is closely related to the difficulty of learning in our setting.
3 Algorithm: Policy Learning via Reward Learning
Given the setup above, we now describe a meta-algorithm, policy learning via reward learning (Algorithm 1), for the non-parametric policy learning problem. The algorithm is a three-stage procedure: it (i) selects a subset of policies to query for reward feedback, (ii) uses the responses to learn a reward estimate , and (iii) optimizes this learnt estimate to output the policy , that is, . Such general plug-in procedure have been studied in the statistics (Van der Vaart, 2000) and the machine learning (Devroye et al., 2013) literature. We analyze the excess risk of this estimator for our doubly-nonparametric setup and use this risk bound to select our query set . We now discuss the two key design choices in our algorithm: the choice of the reward estimation procedure as well as the choice of query set .
Reward learning via ridge regression.
We estimate the reward via ridge regression in the RKHS (Friedman et al., 2001; Shawe-Taylor et al., 2004). Suppose that in the first step of the algorithm, we have already queried the oracle on policies and let represent the query-response pairs. For a regularization parameter , the ridge regression estimate of the reward function is
[TABLE]
The parameter , which is usually set as a function of , controls the bias-variance trade-off in estimating —smaller values of reduce bias while larger values help reduce variance.
Excess risk bound for fixed query set.
Observe that the plug-in estimator is implicitly a function of the query set . Ideally, we want to choose the set which minimizes the expected risk of the plugin estimator. This requires us to solve the optimization problem
[TABLE]
However, solving the above precisely requires knowledge about the underlying reward function , and the combinatorial nature of the optimization problem makes it hard to find an exact solution. To address this, we first upper bound the excess risk of the plug-in policy in terms of the query set . The following theorem333Throughout the paper, for clarity purposes, we denote by a universal constant whose value changes across lines. All our proofs in the appendices explicitly track this constant. bounds the excess risk in terms of the spectrum of the spaces and , as well as the covariance matrix of the queried policies .
Theorem 1** (Excess risk of plug-in).**
For any query set consisting of policies and regularization parameter , the excess risk of the plug-in estimator is upper bounded as
[TABLE]
In addition, letting , the expected squared distance is equal to
[TABLE]
The proof follows a standard analysis of ridge regression and is deferred to Appendix B. Observe that in the above theorem, the query set participates in the excess risk only via the covariance . The risk bound is the sum of two term: the first corresponding to the bias and the second corresponding to the variance. In both these terms, appears as part of —thus query sets which induce a larger correlation with the map will generally have lower excess risk. Choices of queries which are orthogonal to the right singular vectors of will have a constant excess risk, since for those directions the matrix .
As shown later in the appendix, in the special case when the policy set consists of the entire unit ball , the excess risk bound can be improved by a quadratic factor
[TABLE]
Such an improvement in the excess risk when the underlying query set is the complete unit ball in a finite-dimensional space was also observed by Rusmevichientong and Tsitsiklis (Rusmevichientong and Tsitsiklis, 2010). However, the gains in the specific linearly parameterized bandit setup that they considered was logarithmic in nature as compared to our quadratic ones.
4 Query selection and statistical guarantees
We now show how to select the query set effectively and study the excess risk of the corresponding plug-in estimator obtained via this query set. We will start with the special case where the policy set is the unit ball in and the map is diagonal, and then generalize to arbitrary policy sets. In both cases, low excess risk can be achieved by repeatedly querying (approximations of) the projections of top eigenvectors of onto the space. For the special case when the map is diagonal, this reduces to querying the top eigenvectors of .
The excess risk will ultimately depend on the the eigenspectrum of the operator , which is similar to the operator . Additionally, to interpret our results, we instantiate them for a power law spectrum with exponent , that is,
[TABLE]
where corresponds to the singular value of the corresponding operator.
4.1 Warm-up: = unit ball, = diagonal
In order to get some intuition, we study the special case where the policy set consists of the entire unit ball in the space and the map is diagonal with . Further, let us denote the operator .
For this special case, our sampling algorithm (Algorithm 2) simply selects the top eigenvectors of the space to query, for some value which depends on the decay exponent . To see why, observe that for a diagonal map , the right singular vectors of the operator are the same as the eigenvectors of the policy space . Therefore, the choice of policy in our algorithm is simply the scaled eigenfunction . Having selected these queries, the algorithm queries each one of the times and uses this as query set .
The intuition for this choice of query set is that since we are in the passive setup with no knowledge of , any policy can be an optimal policy. By querying the top ones out of these, we can obtain a good enough approximation to the performance of any policy in the unit ball. The particular choice of the parameter depends on the number of queries available. Since the oracle responses are noisy, to reduce variance in the responses along those directions, our algorithm performs multiple queries along the same direction.
If we further consider the special case when the policies and rewards correspond to the unit balls in the finite dimensional spaces and respectively, our choice of query set queries the directions , each for number of times. Intuitively, this strategy works well because without any prior over the unknown reward function, the optimal strategy in the passive setup is to explore all directions equally and this is precisely our set of chosen queries. This simple query strategy enjoys the following excess risk bound.
Proposition 1** (Risk bound for = unit ball.).**
For any and regularization parameter , consider the plug-in estimator obtained via the passive sampling algorithm which explores the first eigenfunctions of . The excess risk satisfies
[TABLE]
where the quantity and is some universal constant.
We defer the proof of the above proposition to Appendix B. The choice of the exploration parameter allows us to trade-off between the two terms inside the maximum. Typically, the second term will be maximized at . For the first term, the supremum depends on the choice of — for small values of , the sup is achieved at while for larger values, it is achieved at . In order to gain more intuition about this bound, we instantiate this for the power law decay.
Corollary 1** (Risk bound for power-law decay).**
Suppose that eigenvalues of the policy space decay as , reward space as and the singular values of map as . This satisfies the power law assumption with exponent . The plug-in estimator with exploration parameter and regularization satisfies
[TABLE]
The proof of the corollary upper bounds the risk bound with the specific choices of and . The above bound shows that our algorithm can learn in the framework as long as or equivalently , with better rates for larger values of . Thus, for a fixed size of reward class , the learning rate improves as the policy class grows smaller ( increases) – this is intuitive since we are required to search over a smaller policy space. On the other hand, for a fixed policy class , our excess risk rate gets better as the reward class grows in size ( increases) – this is because a larger set of reward functions have similar optimal policies and hence learning gets easier.
4.2 General policy sets
We now describe our choice of query sets for general policy sets . Our strategy, described in Algorithm 2, differs from the above special case in that we need to take into account the interaction of the policy space with the map . Specifically, we show in Appendix B that the upper bound in Theorem 1 can be diagonalized for this general case via a transformation.
Let us denote the operator . Our transformation reveals that the relevant directions to query for this general case corresponds to the columns of where , then are the eigenvectors of the self-adjoint operator – and it is precisely a subset of these directions that our algorithm queries.
In order to be able to query these policies, we require the set to contain some policies which align well with them. We formally state this regularity assumption below.
Assumption 1** (Regularity assumption on ).**
For any eigenfunction of the operator , consider the policy . There exists a policy in policy set such that for some constant , we have .
The above assumption requires that for every choice of the policy in Algorithm 2, the set has the another policy which is collinear with it. This assumption can be relaxed in various ways (for instance via convexification) but we omit this as it is not needed for our results. Given this assumption, the following theorem, a generalization of Proposition 1, provides a bound on the excess risk for the plug-in estimate for general policy sets .
Theorem 2** (Risk bound for general policy sets .).**
For any , regularization parameter and set satisfying Assumption 1, let be the estimator output by Algorithm 1. The squared excess risk satisfies
[TABLE]
where the values correspond to the eigen values of the operator with .
We defer the proof of this theorem to Appendix B. The proof of this theorem goes via a transformation which diagonalizes the excess risk bound and reduces the problem to a similar setup as that of Proposition 1. Additionally, Assumption 1 allows us to generalize the results to arbitrary policy sets . Note that the above upper bounds the square of the excess risk. As discussed in Section 3, one can obtain a quadratic improvement in this rate if the set is the entire unit ball in . We specialize the above bound for the power law decay assumption in the following corollary.
Corollary 2** (Risk bound for power-law decay).**
Suppose that eigenspectrum of the operator satisfy the power law assumption with exponent , that is, . The plug-in estimator with parameter and regularization satisfies
[TABLE]
for some universal constant .
The above bound indicates that for the general case, learning is possible if the spectrum decay has parameter . To get such a spectrum decay with the operator defined in the above corollary, one sufficient condition is that the map does not flip the larger eigenvectors of towards the smaller eigenvectors of , that is, the map preserves the ordering of the eigenvectors of when transformed to the space . Such a misaligned scenario would require learning a very accurate representation of the reward to learn a good policy and will make learning harder. It is worth highlighting that while we discuss our bounds with such a power law assumption on the relevant eigenvalues, one can also obtain similar rates for singular values with exponential decay, by optimizing the value of to trade off the bias and variance terms.
4.3 Comparison with UCB-style adaptive algorithms
We next turn to evaluating the sharpness of Theorem 2. Existing frameworks for studying “singly"-nonparametric setups require the input domain to be compact. In our doubly-nonparametric setup, the input space is the policy set which is often non-compact (i.e. the unit ball is not compact in infinite dimensions). We address this for singly-nonparametrics algorithm by taking a finite-dimensional approximation.
Even though our proposed method is passive, it achieves better rates than well-known adaptive sampling algorithms. Specifically, in the power law setting of Section 4.1, the analysis of GP-UCB algorithm (Srinivas et al., 2010) provides a rate of , which is strictly worse than the obtained by our analysis in Corollary 1. We refer the reader to Proposition 2 in Appendix D for an exact statement. The proof adapts the analysis from Srinivas et al. (2010), which hinges on a quantity called the information gain, which we bound for our setup. While we are comparing upper bounds for the two algorithms, we believe that our improved bound is due to a better algorithm and not an analysis gap. While we expect adaptive algorithms to perform better than passive ones in general (Lattimore and Hao, 2021), UCB style algorithms require the construction of confidence intervals around input points, which crucially dictate the regret bounds of such algorithms. In the frequentist setup, the best known such bounds (Vakili et al., 2021) are known to yield suboptimal regret rates and it is an open question as to whether these can be improved.
5 Bounds for kernel multi-armed bandits
In the previous subsection, we saw that our passive sampling algorithm actually outperforms existing adaptive sampling algorithms for the reward learning task we care about. Here we take this a step further—we specialize our algorithm to the case of kernel MABs, and show that it outperforms standard algorithms for that setting and is competitive with a specialized algorithm for Matérn kernels.
We consider the task of maximizing an unknown function over its domain . In the kernel multi-armed bandit (MAB) setup, this unknown function belongs to an RKHS , equipped with a positive-definite kernel444We require that the kernel be a Mercer’s kernel satisfying for all . , such that . Let us further restrict our attention to the space of input points . The learner is allowed to access this function via a noisy zeroth-order oracle
[TABLE]
Going forward we will assume that . The above oracle is similar to the reward oracle , except that the query points belong to a finite dimensional space and is a non-linear function of the query point . The goal in MAB is to minimize the -step regret
[TABLE]
where is the datapoint queried in the round. There have been several algorithms proposed to solve this problem including general purpose UCB algorithms (Srinivas et al., 2010; Chowdhury and Gopalan, 2017), Thompson sampling approaches (Chowdhury and Gopalan, 2017), and special-purpose algorithms for specific kernels (Janz et al., 2020).
We next show that kernel MAB can be cast as a special case of our non-parametric policy learning framework. The resulting regret bounds, derived from an application of Theorem 3, are better than several general purpose algorithms (GP-UCB, IGP-UCB, GP-TS) and comparable to those specialized for the Matérn kernel (-GP-UCB) and SupKernelUCB.
In order to reduce kernel MAB to our framework, we need to introduce three elements – the policy space , the reward space and the map . We would like spaces and such that (1) the resulting objective is linear in this space, (2) the resulting rewards and policies have unit norm in their respective space, and (3) we have a good understanding of the eigenvalues of the resulting operator. This last point ensures that we can employ our upper bounds from Section 4.
Before we define these, we let denote an -net of the input space under the norm and denote its size by . We define the kernel matrix on points selected in the cover as for all .
Reward space . Given the RKHS as well as the elements of the cover , we view the reward function as a map from to , or equivalently as a vector in . More precisely, letting denote the vector of evaluations of a function , we define
[TABLE]
With this notation, we define the true reward .
Policy Space . Similarly to rewards, we will embed policies in . For any point , let denote the corresponding vector in obtained by evaluating the kernel over the cover. Then, the space
[TABLE]
The choice of the above norm ensures that
[TABLE]
Thus in particular, contains an orthonormal embedding of the set of vectors .
Map . Both the reward space and policy space can be associated with . Under this transformation, the evaluation for any corresponds to the standard inner product with
[TABLE]
This indicates that we should take the map to be the identity. Furthermore, as a simple application of Mercer’s theorem it follows that this map is a bounded linear operator.
We make an additional assumption on the kernel function , requiring it to be Lipschitz in its input arguments. This assumption is often satisfied, in particular for the Matérn kernel when .
Assumption 2** (Lipschitz Kernel ).**
The Kernel associated with the Hilbert space is -Lipschitz with respect to the -norm for some :
[TABLE]
Furthermore, the kernel satisfies for all points .
Applying Theorem 2 under the above assumption, we obtain the following excess risk bound for the plug-in estimator evaluated on the unknown function .
Theorem 3** (Excess risk for Kernel MAB).**
Suppose that the eigenvalues of a -Lipschitz kernel satisfy the power-law decay . Let be the output of Algorithm 1 using queries to the oracle . Then, for any value of and , the excess risk satisfies
[TABLE]
with probability at least .
For Matérn kernels, it is known that the eigenvalues decay with parameter (Janz et al., 2020). Substituting this along with a bound on the covering number , we obtain the following corollary.
Corollary 3** (Regret bound for Matérn Kernel).**
Consider the family of Matérn kernels with parameter defined with the Euclidean norm over d. The -step regret of our algorithm is
[TABLE]
We defer all the proofs as well as a detailed introduction to the family of Matérn kernels to Appendix C. Note that the above bound is for regret, which is an online notion, while our previous results are offline notions. We get from one to the other using a standard batch-to-online conversion bound based on an explore-then-commit strategy. Table 1 compares the above bound to the existing literature.
6 Experimental evaluation
We experimentally evaluate our algorithm via a simulation study. We use these experiments to establish the dimension free nature of our results as well as to conjecture optimality of our bounds.
Setup. In the simulation study, we work with dimensional RKHSs and . In order to simulate the nonparametric regime, we typically use value of which are less or at most a constant times the dimension . We set the matrices , and the map . With this, the effective decay parameter . We further sampled the oracle noise . All plots were averaged over 10 runs.
Observations. Figure 1(a) shows the variation of excess risk as the number of queries are varied from to on a log-log plot. Our bounds in Corollary 1 for this setup predict that the excess risk should decay at a rate . By fitting a linear line through the plot, we found that observed risk to vary as . This plot is suggestive of the fact that our theoretical upper bounds might be tight in a minimax way over choices of decay parameter . In Figure 1(b), we plot the excess risk as we vary the dimension from to for four different choices of sample size, again, on a log-log scale. Increasing the number of queries decreases the excess risk for all dimensions consistently. The risk curves tend to asymptote at different error levels for different values of . This corroborates our theoretical findings that our proposed algorithm provides non-vacuous bounds for the doubly-nonparametric setup when .
7 Discussion
In this work, we proposed a new theoretical framework, Doubly Nonparametric Bandits, for studying the reward learning problem. We derived non-asymptotic bounds on the excess risk of a ridge regression based plug-in estimator and showed how the well studied GP bandit optimization problem can be cast as a special case of our rich framework. Our current analysis relies on a regularity assumption on the policy space ; can we obtain bounds on the excess risk in the absence of this assumption?
Going forward, it would be interesting to study the closed loop dynamics between the reward and the policy learning algorithm when the learner actively queries for feedback.
Acknowledgements
We are grateful to Erik Jones for providing feedback on an early draft of the work. We would like to thank members of the Steinhardt group and InterACT lab for helpful discussions. KB was supported in part by a grant from Long-Term Future Fund (LTFF).
Appendix A Technical details for proposed framework
A.1 RKHS assumption
The Hilbert spaces and are Reproducing Kernel Hilbert Spaces defined by kernel functions respectively defined over a compact instance space . Further, the kernels and satisfy the Hilbert-Schmidt condition
[TABLE]
for some distribution over space . Mercer’s theorem (Mercer, 1909) implies that such kernel functions have an associated set of eigenfunctions (with corresponding eigenvalues) that form an orthonormal basis for . We restate a version of this theorem below (Wainwright, 2019).
Theorem 4** (Mercer’s theorem).**
Suppose that the space is compact and the positive semi-definite kernel satisfies the Hilbert-Schmidt condition (18). Then there exists a sequence of eigenfunctions that form an orthonormal basis of and non-negative eigenvalues such that
[TABLE]
Furthermore, the kernel function has the expansion
[TABLE]
where the convergence of the sequence holds absolutely and uniformly.
A.2 Conditions for reward boundedness
For learning to be feasible in the proposed framework, we would require that the evaluation functional is bounded for any policy . Using the fact that and , we have
[TABLE]
Thus one sufficient condition for the reward functional to be bounded is to ensure that the operator norm is finite. In the special case when the map is diagonal with , the above condition simplifies to
[TABLE]
A.3 Regularity assumptions on map
We assume that the map is a compact bounded operator from the policy space to the reward space . By Schauder’s theorem, the adjoint is also a compact operator. Thus, the map is a compact self-adjoint operator. This allows us to use the spectral theorem for compact self-adjoint operators which guarantees the existence of eignevalues and eignefunctions for the operator and a corresponding singular value decomposition for the map (Kreyszig, 1978).
A.4 Non-aligned RKHSs
As mentioned in the Section 2, if the eigenvectors of the spaces and are not aligned, one can consider the following simple transformation which resolves this. Let and represent the eigenvectors.
[TABLE]
The above transformation implies that and .
Appendix B Proof of main results
In this section we provide the proofs for the main results of this work. Appendix D to follow contains the proofs for the other results.
B.1 Proof of Theorem 1
We begin by proving the result for the special case when the policy set consists of the entire unit ball and then generalize the analysis to arbitrary policy sets.
Case 1: is unit ball in .
For this special case, observe that the the optimal policy and the plug-in policy for any reward estimate can be written as
[TABLE]
where the operator is the adjoint of of the map . To prove a bound on the excess risk using the plug-in estimate, we use the following lemma which bounds this error in terms of deviation of the estimated and true rewards.
Lemma 1**.**
Consider any vectors and with finite non-zero norm under some inner product . Then, we have
[TABLE]
The proof of the above lemma is presented in Section B.1.1. Taking the above as given, we can upper bound the excess risk
[TABLE]
Case 2: Arbitrary set .
For this case, consider the excess risk of plug-in estimator obtained by maximizing reward estimate
[TABLE]
where the final inequality follows from the fact that maximizes over the set .
Thus, we see that for both the cases above, we can upper bound the excess risk of the plug-in estimator in terms of the norm . Next, we evaluate this for the ridge regression based reward estimator for any set of queries with covariance matrix . For any regularization parameter , we have,
[TABLE]
where and equality follows by substituting the value of . Let us denote by matrix . Therefore, the error in reward estimation
[TABLE]
where the final distribution follows from our assumption on the noise variables . Using this above distributional form, we have
[TABLE]
The final bound for the general policy set follows from using the above bound with a an application of Jensen’s inequality. In order to convert the above bound to a high probability bound, we require an infinite dimensional analog of the Hanson-Wright concentration inequality. Using Theorem 2.6 from Chen and Yang (2021) along with equation (B.1), we obtain
[TABLE]
where the covariance matrix . ∎
B.1.1 Proof of Lemma 1
Let the vector for some difference vector . Using this, we have
[TABLE]
where follows from using the inequality . This establishes the result. ∎
B.2 Proof of Proposition 1
Let us denote the the map and the covariance matrix . From the upper bound obtained in Theorem 1, we have,
[TABLE]
where inequality follows from using the fact that and inequality uses the diagonal structure of the map as well as the fact that each policy has unit -norm.
Recall that the choice of querying strategy queries each scaled eigenfunction of the policy space times. Therefore the entry of the covariance matrix is given by
[TABLE]
Plugging the above value of into equation (B.2), we obtain
[TABLE]
This concludes the proof of the proposition. ∎
B.3 Proof of Corollary 1
We now derive explicit finial sample rates for the case when the spectrum of the map satisfies a power law decay for some parameter . In the notation used in Proposition 1, we have the quantity
[TABLE]
Our proof strategy will be to instantiate the bias and variance terms for this setting of and finally select a setting for the exploration parameter and regularization parameter .
Bounding Bias.
The bias term in the proposition is a max over two terms
[TABLE]
We consider the two terms in the analysis here separately. For the first term,
[TABLE]
where the final inequality follows from using . For the second term, we have
[TABLE]
Bounding Variance.
Recall that the variance term (assuming ) is given by
[TABLE]
We again consider both terms of the maximum separately. For the first term,
[TABLE]
where the inequality follows from ignoring the term in the denominator. For the second variance term,
[TABLE]
Setting regularization parameter.
By setting , we can have that the bias term is dominated by . Similarly, the above setting also implies that the variance term is dominated by . Combing these observations, we have that the expected error is upper bounded by
[TABLE]
Setting and then , we get that
[TABLE]
This completes the proof of the corollary. ∎
B.4 Proof of Theorem 2
In order to prove the general theorem, we exhibit a transformation which allows us to reduce the problem to that with the diagonal structure described in Proposition 1.
We will consider orthogonally diagonalizable matrices and which represent the eigenvectors and eigenvalues of the Hilbert spaces and . Consider the following set of transformations for any reward and policy .
[TABLE]
With this transformation, we can rewrite the objective function above
[TABLE]
where the inner product denotes the standard inner product. Observe that we have overloaded notation to denote by . Further, using these above transformations, we can rewrite the adjoint operator
[TABLE]
Recall from Theorem 1, the matrix
[TABLE]
where the covariance matrix . We have used the fact here that the matrices and are orthogonally diagonalizable and hence symmetric. Finally, we will denote the singular value decomposition of the compact map in the matrix form as
[TABLE]
The existence of such a decomposition is guaranteed by the regularity assumptions we consider on the map in Appendix A. We will now analyze the bias and the variance terms from the upper bound on from Theorem 1.
Bound on bias.
The squared bias term is given by
[TABLE]
where we have used the SVD decomposition for the matrix in the last step.
Bound on variance.
The variance term is given by
[TABLE]
Finally, by making a substitution for reward and policy in equations (B.4) and (B.4), we recover back the bias variance expressions used in the analysis for Proposition 1. What remains to be shown is that our particular choice of query policies correspond to basis vectors in this transformed space. For this, observe that the sampling policies
[TABLE]
is such that the transformed policies
[TABLE]
indeed correspond to the basis vector. This finishes the proof of the desired claim. ∎
B.5 Proof of Corollary 2
The proof of this corollary follows similar to that of Corollary 1 in terms of bounding the bias and the variance. The final rate follows by an application of Jensen’s inequality to conclude
[TABLE]
The final rate that we get in this case is thus upper bounded by the square root of the rate observed in Corollary 1. This concludes the proof. ∎
Appendix C Gaussian process bandit optimization
In this section, we discuss in detail the application of our framework to the problem of frequentist Gaussian process bandit optimization, also known as Kernelized multi-armed bandits (MAB) problem. Recall the reduction of the Kernel MAB problem to our setup required us to define three elements.
Reward space .
Given the RKHS as well as the elements of the cover , we view the reward function as a map from to , or equivalently as a vector in . More precisely, letting denote the vector of evaluations of a function , we define
[TABLE]
where represents the standard inner product. With this notation, let us define the true reward .
Policy Space .
For the policy space in our setup, we let
[TABLE]
The choice of the above norm ensures that
[TABLE]
For the policy space , we have created an orthonormal embedding of the set of vectors . Observe that this policy set that we construct satisfies the regularity Assumption 1 because each vector is an eigenvector of the space .
Map .
By our assumption that the kernel is a Mercer’s kernel, we have that , that is, for all , the vector . Furthermore, both and are sub-spaces of and we can take the map .
With these definitions, we now explicitly establish a correspondence between our doubly nonparameteric bandit problem and the Kernel MAB problem.
C.1 Connecting the problems
Given an RKHS with an associated Mercer’s kernel , the objective of the zeroth-order bandit optimization problem is
[TABLE]
with access to oracle
[TABLE]
Equivalently, the objective in our reward learning framework is
[TABLE]
with the corresponding spaces and inner products are defined in the previous section. The oracle required in our setup responds with
[TABLE]
for any policy such that . Our first lemma below states that obtaining such a n oracle is indeed feasible if we are able to restrict our queries to include only points for which the vector .
Lemma 2**.**
Given access to oracle for a function , the corresponding oracle can be implemented when the query set consists of .
Proof.
For any query point , the oracle needs to compute the value . Thus, these two oracles on the provided query set are exactly identical. ∎
Lemma 3**.**
For any satisfying , we have that .
Proof.
Observe that an alternate way to define the RKHS norm is given by
[TABLE]
The fact that is computed on establishes the desired claim. ∎
Finally, we turn to establishing a relation between the solutions obtained from solving the relaxed problem (P2) as compared to solving the original problem (P1). We denote the corresponding maximizers for both problems
[TABLE]
The following lemma now relates both these maximizers together.
Lemma 4**.**
For an RKHS with kernel satisfying Assumption 2 with constant and any function , let and be the maximizers as defined in equation (57), we have
[TABLE]
Proof.
Denote by the projection of the point onto the set . Then, we have
[TABLE]
This completes the proof of the lemma. ∎
The above lemma shows that solving Problem P2 is equivalent to solving Problem P1 up to an additive factor of when we are working with an -cover over the domain space.
C.2 Analysis for bandit optimization
Recall from the previous section that the quantity which determines the rate of decay is the ratio of eigenvalues
[TABLE]
where is the eigenvalue of the kernel matrix . Let us denote by denote the uniform distribution over the input space and let us suppose that the cover is formed using random samples from this distribution. Let us denote by the eigenvalues and by the corresponding eigen vectors of the Mercer kernel . For every point , let us denote by
[TABLE]
the corresponding featurization of the point . Then, for , we have
[TABLE]
Observe that the kernel matrix and the (scaled) sample covariance matrix are similar matrices and thus have the same eigenvalues. The following lemma, adapted from Koltchinskii and Lounici (2017, Theorem 9) relates the eigenvalues of the sample covariance matrix to those of the underlying kernel .
Lemma 5**.**
For any and size of the cover satisfying , we have,
[TABLE]
with probability at least .
The following corollary of Lemma 5 provides us with a way to control the deviation of the eigenvalues from the corresponding in a multiplicative manner.
Corollary 4**.**
For any value of decay parameter and , we have, for all , the eigenvalues
[TABLE]
with high probability.
Proof.
Let us understand the condition and see what restrictions it puts on the value of the covering number. Lets suppose that the true eigen values and we set the value of . Therefore, the sum
[TABLE]
Thus, if we set , then for any and , the above condition on the covering number will be satisfied and we get desired bound on the deviation of the empirical eigenvalues from population eigenvalues. ∎
The above corollary is essential to our argument because often times we have a good understanding of the decay of the eigenvalues of the kernel associated with the RKHS and this allows us to relate the set of empirical eigenvalues to these.
We now present a proof of Theorem 3, restated below, which upper bounds the excess risk for this setup. We will then use a batch to online conversion bound to convert this to a regret bound and specialize to the Matérn kernel later.
Theorem 5** (Restated Theorem 3).**
Suppose that the eigenvalues of a -Lipschitz kernel with respect to a distribution over satisfy the power-law decay . Let be the output of Algorithm 1 using queries to the oracle . Then, for any value of and , the excess risk
[TABLE]
with high probability.
Proof.
Our strategy, as before, will be to explore directions and assume . Recall, that for symmetric matrices, Theorem 2, the excess error of the plug-in estimator can be upper bounded as
[TABLE]
Bounding Bias.
We will split the analysis into two cases.
Case 1: . For this case, we have that and therefore
[TABLE]
with the above holding with high probability from an application of Corollary 4 for any .
Case 2: . For this case, we have . The bias can then be upper bounded as
[TABLE]
where the final inequality follows from using .
Bounding variance.
As we did in the section above, let us split the analysis into two cases.
Case 1: . For this case, the variance term simplifies to
[TABLE]
Case 2: . For the second case, we can upper bound the variance term
[TABLE]
where the last inequality follows from ignoring the second term in the denominator.
Setting regularization parameter.
From the analysis in the above paragraphs, we have
[TABLE]
For regularization parameter and , we have
[TABLE]
Excess risk bound.
To obtain the final excess risk bound, we set
[TABLE]
where inequality follows from our particular choice of . Combining the above bound with Lemma 4 completes the proof. ∎
The following corollary instantiates the above theorem for the case when the input space is the unit ball, that is, .
Corollary 5**.**
Let the input space and the kernel satisfy Assumption 2. Then, for any , we have
[TABLE]
Proof.
From the bound in Theorem 3, we have,
[TABLE]
where inequality follows from substituting , follows from the fact that , and follows from using the assumption that .
Finally, setting , we get
[TABLE]
This establishes the desired claim. ∎
C.3 Regret bound for Matérn Kernel
In this section, we specialize the bound from Theorem 3 for the special class of Matérn kernels. Recall that the Matern kernel is a distanced based kernel with . Denote by , the exact form for the kernel is given by
[TABLE]
with parameters and and where is the modified Bessel function of the second kind. Going forward, lets fix the scale parameter without loss of generality.
The following lemma then bounds the Lipschitz constant for this class of kernels when the distance function is the norm.
Lemma 6** (Lipschitz Matérn Kernel).**
Consider the Matérn kernel with parameter . The Lipschitz constant of this kernel is bounded by
[TABLE]
Proof.
The approach will be to show that the kernel function is a Lipschitz function of the distance and then cover the ball in the dimensional space appropriately. We now look at the derivative of the function with respect to .
[TABLE]
where follows from the identity .
For any , we have the inequality
[TABLE]
Instantiating the above with and , we have
[TABLE]
The Lipschitz constant for this case can now be obtained by taking a sup over . ∎
While our upper bound was in terms of sample complexity, in order to compete with the cumulative regret formulation, we adapt an explore-then-commit strategy. The following lemma relates the sample complexity bound to a cumulative regret bound.
Lemma 7** (Batch to online conversion).**
Suppose an algorithm has sample complexity in the passive learning setup, the explore then commit strategy based on this learning algorithm would have regret .
Proof.
For some parameter , let the explore then commit algorithm explore for steps and the commit to the strategy obtained post this exploration for the remaining time steps. The cumulative regret for such an algorithm is
[TABLE]
Setting finishes the proof. ∎
We now proceed to prove Corollary 3 which instantiates the bound in Theorem 3 for the class of Matérn kernels.
Corollary 6** (Restated Corollary 3).**
Consider the family of Matérn kernels with parameter defined with the euclidean norm over d. The -step regret of the explore-then-commit algorithm is
[TABLE]
with high probability.
Proof.
First, observe that excess risk bound in Corollary 5 can be converted to a corresponding -step regret bound by an application of Lemma 7 such that
[TABLE]
For the class of Matérn kernels, the decay parameter (Janz et al., 2020, Theorem 9). Using this wit the above regret bound, we get,
[TABLE]
This completes the proof. ∎
Appendix D Adaptive sampling via GP-UCB
In this section, we prove an upper bound on the expected risk of the Gaussian process upper confidence bound algorithm (GP-UCB) algorithm of Srinivas et al. (2010). In order to adapt their algorithm for our setup, consider the function
[TABLE]
We have used to denote policies in this setup to be consistent with the notation in Srinivas et al. (2010). Observe that the domain defined above is not compact – a necessary condition for the algorithm to work. One work around this is to truncate the unit ball after a finite number of dimensions and bound this truncation error. The excess risk incurred by this truncation can be made arbitrary small. Going forward, we ignore this truncation. The regret for the UCB algorithm is shown to be upper bounded by where is the information gain with
[TABLE]
where we have assumed without loss of generality that the noise variance . For our setup, the kernel function . We additionally require that the reward function belongs to the RKHS spanned by the set . Denote by and suppose that its eigenvalues satisfy a power law decay with . The following lemma upper bounds the information gain for this setup in terms of the power law parameter .
Lemma 8** (Information Gain.).**
The information gain for the above setup is bounded as
[TABLE]
Proof.
The quantity of interest here is the information gain
[TABLE]
where the matrix and we have assumed that the noise variance is . From the setup described above, we have that the eigen values of decay as . It is easy to see that
[TABLE]
is a monotonic sub-modular function. Thus, the value of can be upper bounded by times the value of the greedy maximization algorithm. The greedy maximization algorithm is equivalent to picking
[TABLE]
It is easy to see that at each time , the unit vector will be an eigen vector of the matrix . Given this observation, we can finally upper bound the value of the info gain
[TABLE]
Solving the above optimization problem, the optimal choice of the variables
[TABLE]
Setting ensures that there are active directions. Substituting the above values of in the expression for , we get
[TABLE]
where follows from setting . This establishes the required claim. ∎
We are now ready to state this our sample complexity bound for GP-UCB for this subclass of problems.
Proposition 2** (Sample complexity for GP-UCB).**
Suppose that the police space , reward space and the map satisfy the power law decay assumption with exponent . The estimator output by the GP-UCB algorithm satisfies
[TABLE]
The proof of the sample complexity bound in Proposition 2 now follows the regret bound of along with using the upper bound on the information gain from Lemma 8.
[TABLE]
More recently, Cai and Scarlett (2021) extended the analysis of Valko et al. (2013) to show that the SupKernelUCB algorithm achieves a regret bound . Using this modified bound, one can improve the above analysis to obtain excess risk
[TABLE]
which is still worse than those obtained by the bounds by our proposed ridge regression estimator.
Appendix E Further details on experimental evaluation
In the simulation study, we work with dimensional RKHSs and . In order to simulate the nonparmeteric regime, we typically use value of which are less or at most a constant times the dimension . We set the matrices , and the map . This is allowed since the policy space is smaller than the reward space. With this, the effective decay parameter . We sampled the true reward uniformly at random from the unit ball in . We further sampled the oracle noise . All plots were averaged over 10 runs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Atkinson (1996) A. C. Atkinson. The usefulness of optimum experimental designs. Journal of the Royal Statistical Society: Series B (Methodological) , 58, 1996.
- 2Böhm et al. (2019) F. Böhm, Y. Gao, C. M. Meyer, O. Shapira, I. Dagan, and I. Gurevych. Better rewards yield better summaries: Learning to summarise without references. ar Xiv preprint ar Xiv:1909.01214 , 2019.
- 3Bubeck et al. (2011) S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári. X-armed bandits. Journal of Machine Learning Research , 12(5), 2011.
- 4Cai and Scarlett (2021) X. Cai and J. Scarlett. On lower bounds for standard and robust gaussian process bandit optimization. In International Conference on Machine Learning . PMLR, 2021.
- 5Chaloner and Verdinelli (1995) K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science , pages 273–304, 1995.
- 6Chen and Yang (2021) X. Chen and Y. Yang. Hanson–wright inequality in hilbert spaces with application to k 𝑘 k -means clustering for non-euclidean data. Bernoulli , 27(1):586–614, 2021.
- 7Chowdhury and Gopalan (2017) S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning , 2017.
- 8Christiano et al. (2017) P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. ar Xiv preprint ar Xiv:1706.03741 , 2017.
