Offline Contextual Bandits with Overparameterized Models
David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna

TL;DR
This paper investigates whether overparameterized models in offline contextual bandits generalize well, finding that value-based methods do due to action-stability, while policy-based methods do not, leading to different performance outcomes.
Contribution
The paper identifies action-stability as a key factor explaining the differing generalization behaviors of value-based and policy-based algorithms in overparameterized offline contextual bandits.
Findings
Value-based algorithms benefit from overparameterization and generalize well.
Policy-based algorithms do not exhibit the same generalization benefits.
Experimental results show significant performance differences aligned with theoretical analysis.
Abstract
Recent results in supervised learning suggest that while overparameterized models have the capacity to overfit, they in fact generalize quite well. We ask whether the same phenomenon occurs for offline contextual bandits. Our results are mixed. Value-based algorithms benefit from the same generalization behavior as overparameterized supervised learning, but policy-based algorithms do not. We show that this discrepancy is due to the \emph{action-stability} of their objectives. An objective is action-stable if there exists a prediction (action-value vector or action distribution) which is optimal no matter which action is observed. While value-based objectives are action-stable, policy-based objectives are unstable. We formally prove upper bounds on the regret of overparameterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
MethodsQ-Learning
