Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning
Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert Nowak

TL;DR
This paper introduces a novel pretraining method for decision transformers that learns to outperform demonstrators in multi-task structured bandit problems by exploiting shared task structures without requiring privileged information.
Contribution
It proposes a new pretraining approach for transformers that learns near-optimal policies in-context, surpassing prior methods that need privileged information or cannot outperform demonstrators.
Findings
Transformer pretraining outperforms demonstrators on unseen tasks
Method generalizes across various structured bandit problems
Achieves rapid reward prediction for effective exploration
Abstract
We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research · Smart Grid Energy Management
MethodsSix Ways To Communicate To Someone At Expedia Via Phone And Email's. · Attention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Dense Connections · Convolution · Residual Connection · Layer Normalization · Dense Prediction Transformer
