Contextual bandits with entropy-based human feedback
Raihan Seraj, Lili Meng, Tristan Sylvain

TL;DR
This paper introduces an entropy-based framework for human feedback in contextual bandits, improving performance with minimal feedback by selectively requesting expert input based on model uncertainty.
Contribution
It proposes a novel entropy-based feedback mechanism that dynamically balances exploration and exploitation in contextual bandits, adaptable to any stochastic policy.
Findings
Significant performance improvements with minimal human feedback
Robustness to suboptimal feedback quality
Model-agnostic and easily integrable approach
Abstract
In recent years, preference-based human feedback mechanisms have become essential for enhancing model performance across diverse applications, including conversational AI systems such as ChatGPT. However, existing approaches often neglect critical aspects, such as model uncertainty and the variability in feedback quality. To address these challenges, we introduce an entropy-based human feedback framework for contextual bandits, which dynamically balances exploration and exploitation by soliciting expert feedback only when model entropy exceeds a predefined threshold. Our method is model-agnostic and can be seamlessly integrated with any contextual bandit agent employing stochastic policies. Through comprehensive experiments, we show that our approach achieves significant performance improvements while requiring minimal human feedback, even under conditions of suboptimal feedback…
Peer Reviews
Decision·Submitted to ICLR 2025
The setting explored in this paper seems novel to me, though I am not an expert in the Bandit space. I wonder if the area of off-policy bandits are relevant here ( see https://arxiv.org/abs/2010.12470) since by taking human feedback in some interactions, the bandit is somewat getting rewards "off-policy". The combination of algorithms and settings explored seem quite thorough and from what I can tell, some interesting insights can be gleaned about the role of "AR" feedback vs "RM".
1. No theoretical analysis is done in this setting, which is usually the case for Bandit algoirthms, from my limited experience. 2. There is something about the formulation I dont get. It seems the bandit algorithms will be incentivized to maximize entropy as much as possible, in order to get the benefit of as much human feedback as possible (at least for a sufficient amount of expertise from the human). In other words, the formulation does not really assign any cost to the act of getting hum
- The paper is well-written - The experiment is fairly thorough
Issues related to the methodology: 1. The paper didn't cite a few influential works in this area: [1] acquires value annotation for labeling actions; [2] DAGGAR, which also directly gets experts to perform an action (similar to what this paper has proposed); and [3] APO, which actively selects which data to get trajectory-level preference label from. I especially consider [1] and [3] relevant to this paper's context. 2. The methodology of simply selecting data points based on the policy's entrop
- The proposed method is simple, easy to implement, and presented in a clear way. - The setting of incorporating expert feedback/intervention in the classical bandit framework is interesting.
- The paper’s presentation is somewhat unclear, particularly in the problem formulation. It seems to propose a stronger variation of bandit problems where the model can access oracle labels, but this isn’t clearly explained. Section 3.1 could be revised to clarify this new setup. - With this revised formulation, it would also help to include a theoretical guarantee for the proposed algorithm, addressing the how it affects on overall regret, the lower bound on the new formulation, and whether th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Emotion and Mood Recognition · Mental Health Research Topics
