Prompt Tuning Decision Transformers with Structured and Scalable Bandits
Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao

TL;DR
This paper introduces a bandit-based prompt tuning method for Decision Transformers in offline RL, improving task generalization and scalability by learning optimal prompts at inference time with theoretical guarantees.
Contribution
It proposes a structured bandit architecture for prompt construction, leveraging pre-trained PDT features, and provides theoretical regret bounds with empirical performance improvements.
Findings
Achieves linear scaling with prompt size
Enhances performance across diverse tasks and environments
Outperforms existing prompt tuning baselines
Abstract
Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT) enables task generalization via trajectory prompts sampled uniformly from expert demonstrations -- without accounting for prompt informativeness. In this work, we propose a bandit-based prompt-tuning method that learns to construct optimal trajectory prompts from demonstration data at inference time. We devise a structured bandit architecture operating in the trajectory prompt space, achieving linear rather than combinatorial scaling with prompt size. Additionally, we show that the pre-trained PDT itself can serve as a powerful feature extractor for the bandit, enabling efficient reward modeling across various environments. We theoretically establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research
MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam
