Offline Contextual Bandits in the Presence of New Actions
Ren Kishimoto, Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Yuta Saito

TL;DR
This paper introduces a novel off-policy learning method, PONA, that effectively incorporates new actions in dynamic decision-making environments by leveraging action features and a new estimator, LCPI.
Contribution
The paper proposes the PONA algorithm and LCPI estimator, enabling off-policy learning to incorporate new actions using action features, a capability lacking in existing methods.
Findings
PONA outperforms existing methods in selecting new actions.
LCPI effectively balances reward modeling and data collection conditions.
PONA maintains overall policy performance while integrating new actions.
Abstract
Automated decision-making algorithms drive applications such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward from an existing action set. However, in many real-world scenarios, actions, such as news articles or video content, change continuously, and the action space evolves over time after data collection. We define actions introduced after deploying the logging policy as new actions and focus on OPL with new actions. Existing OPL methods identify optimal actions from the existing set effectively but cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. We first introduce the Local Combination PseudoInverse (LCPI) estimator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
