Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL
Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao

TL;DR
This paper introduces a bandit-based prompt-tuning framework for offline multi-task reinforcement learning, improving task generalization and sample efficiency without fine-tuning large models.
Contribution
It proposes an inference-time bandit approach to optimize trajectory prompts, addressing limitations of uniform prompt sampling in multi-task offline RL.
Findings
Enhanced task performance with bandit prompt-tuning
Improved sample efficiency and scalability
Better exploration of prompt space compared to baselines
Abstract
Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline Reinforcement Learning (RL) pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: not all prompts are equally informative for differentiating between tasks. This limits generalization and adaptation, especially in low-data or open-world settings where sample efficiency is crucial. To address this issue, we propose a lightweight, inference-time, bandit-based prompt-tuning framework. The bandit explores and optimizes trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Mobile Crowdsensing and Crowdsourcing
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding
