Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee; Josiah P. Hanna; Qiaomin Xie; Robert Nowak

arXiv:2406.05064·cs.LG·October 24, 2025

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, Robert Nowak

PDF

Open Access

TL;DR

This paper introduces a novel pretraining method for decision transformers that learns to outperform demonstrators in multi-task structured bandit problems by exploiting shared task structures without requiring privileged information.

Contribution

It proposes a new pretraining approach for transformers that learns near-optimal policies in-context, surpassing prior methods that need privileged information or cannot outperform demonstrators.

Findings

01

Transformer pretraining outperforms demonstrators on unseen tasks

02

Method generalizes across various structured bandit problems

03

Achieves rapid reward prediction for effective exploration

Abstract

We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure from data collected by a demonstrator on a set of training task instances. Our objective is to devise a training procedure such that the transformer will learn to outperform the demonstrator's learning algorithm on unseen test task instances. Prior work on pretraining decision transformers either requires privileged information like access to optimal arms or cannot outperform the demonstrator. Going beyond these approaches, we introduce a pre-training approach that trains a transformer network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research · Smart Grid Energy Management

MethodsSix Ways To Communicate To Someone At Expedia Via Phone And Email's. · Attention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Dense Connections · Convolution · Residual Connection · Layer Normalization · Dense Prediction Transformer