Second Order Bounds for Contextual Bandits with Function Approximation

Aldo Pacchiano

arXiv:2409.16197·cs.LG·March 18, 2025

Second Order Bounds for Contextual Bandits with Function Approximation

Aldo Pacchiano

PDF

Open Access 3 Reviews

TL;DR

This paper introduces new algorithms for contextual bandits with function approximation that achieve regret bounds scaling with the sum of measurement variances, not just the time horizon, even when variances are unknown.

Contribution

It develops the first algorithms with second order regret bounds based on measurement variances in the setting of contextual bandits with function approximation.

Findings

01

Regret bounds scale with the sum of variances, not the square root of time.

02

Algorithms handle unknown variances effectively.

03

Generalizes second order bounds to complex function classes.

Abstract

Many works have developed no-regret algorithms for contextual bandits with function approximation, where the mean reward function over context-action pairs belongs to a function class. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as square root of the product of the eluder dimension (a statistical measure of the complexity of the function class), the logarithm of the function class size and the time horizon. Unfortunately, even if the variance of the measurement noise of the rewards at each time is changing and is very small, the regret of the optimistic least squares algorithm scales with square root of the time horizon. In this work we are the first to develop algorithms that satisfy regret bounds of…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

1. Solving for varying variance over time period is practical and may yield tighter regret bound in some cases. 2. The analysis of the variance and the truncated loss has novelty.

Weaknesses

1. The organization of the paper is hard to follow. The theorems and colloraries are hard to follow since there is no intuitive explanation on the terms (e.g., $\mathcal{G}_t^\prime(\tau_i)$, $\tilde{\mathcal{E}_t^\prime}$ ). It is confusing whether the result mentions either $\sigma_t=\sigma$ case or not. I suggest the authors move the $\sigma_t =\sigma$ case to appendix to clearly see the results. 2. Typos: In line 163 $\mathcal{X} \times \mathcal{X} .. $ should be $\mathcal{X} \times \mathc

Reviewer 02Rating 6Confidence 3

Strengths

- The authors present a study of contextual bandits with function approximation by developing algorithms with variance-dependent regret bounds, marking the first instance of such bounds in this setting. - These results improve upon existing bounds by requiring only a realizability assumption on the mean reward function, avoiding stronger, often impractical assumptions on measurement noise. - The paper proposes two algorithms with distinct performance guarantees. Algorithm 1 is designed for case

Weaknesses

However, several aspects of the paper’s presentation and empirical support could be improved: - The authors did not include empirical experiments or numerical results to validate the theoretical bounds. This leaves the practical effectiveness of the algorithms untested. - The theoretical content is densely packed, with lemmas presented consecutively without sufficient exposition, making it challenging to follow the logical flow of arguments.

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper studied a variance-dependent regret bound for contextual bandits, which I think is an interesting problem to study. 2. The paper is generally well-written, and the proof appears to be correct. 3. Their regret bound for contextual bandits with unchanged unknown variance matches the results for linear bandits. I think this is a valuable addition to the literature and may have further theoretical impacts.

Weaknesses

1. Given the variance-dependent regret bound for linear bandits in the literature, it is not surprising that one can obtain a variance-dependent regret bound under general function approximation. Moreover, there is still a $\sqrt{d}$ factor gap in the general case where variance changes are allowed, compared to the linear setting. I would suggest that the authors highlight the obstacle to removing this gap in the main content. 2. The approach in this paper adapts [Zhao et al., 2023] to the gene

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research