Aligning AI Agents via Information-Directed Sampling
Hong Jun Jeon, Benjamin Van Roy

TL;DR
This paper introduces a new bandit alignment framework addressing long-term AI alignment by balancing exploration of environment and human preferences, demonstrating the effectiveness of information-directed sampling over naive methods.
Contribution
It extends classic multi-armed bandit problems to include human preferences and costs, proposing an information-directed sampling approach for better AI alignment.
Findings
Naive exploration algorithms perform poorly in the alignment setting.
Information-directed sampling achieves lower regret in the toy problem.
Current algorithms like Thompson sampling are inadequate for this alignment task.
Abstract
The staggering feats of AI systems have brought to attention the topic of AI Alignment: aligning a "superintelligent" AI agent's actions with humanity's interests. Many existing frameworks/algorithms in alignment study the problem on a myopic horizon or study learning from human feedback in isolation, relying on the contrived assumption that the agent has already perfectly identified the environment. As a starting point to address these limitations, we define a class of bandit alignment problems as an extension of classic multi-armed bandit problems. A bandit alignment problem involves an agent tasked with maximizing long-run expected reward by interacting with an environment and a human, both involving details/preferences initially unknown to the agent. The reward of actions in the environment depends on both observed outcomes and human preferences. Furthermore, costs are associated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques
MethodsSoftmax · Attention Is All You Need
