Calibration-Gated LLM Pseudo-Observations for Online Contextual Bandits
Maksim Pershin, Ivan Golovanov, Pavel Baltabaev, Natalia Trankova

TL;DR
This paper introduces a method to improve online contextual bandit algorithms by using large language models to generate pseudo-observations, with a calibration-based weighting scheme to enhance early decision-making.
Contribution
It proposes a novel augmentation of Disjoint LinUCB with LLM pseudo-observations and a calibration-gated decay schedule to mitigate cold-start regret.
Findings
LLM pseudo-observations reduce regret by 19% on MIND with task-specific prompts.
Prompt design significantly impacts performance, more than decay schedule or gating parameters.
Calibration gating's effectiveness varies with prediction error levels, affecting bias-variance trade-offs.
Abstract
Contextual bandit algorithms suffer from high regret during cold-start, when the learner has insufficient data to distinguish good arms from bad. We propose augmenting Disjoint LinUCB with LLM pseudo-observations: after each round, a large language model predicts counterfactual rewards for the unplayed arms, and these predictions are injected into the learner as weighted pseudo-observations. The injection weight is controlled by a calibration-gated decay schedule that tracks the LLM's prediction accuracy on played arms via an exponential moving average; high calibration error suppresses the LLM's influence, while accurate predictions receive higher weight during the critical early rounds. We evaluate on two contextual bandit environments - UCI Mushroom (2-arm, asymmetric rewards) and MIND-small (5-arm news recommendation) - and find that when equipped with a task-specific prompt, LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
