The dangers in algorithms learning humans' values and irrationalities
Rebecca Gorman, Stuart Armstrong

TL;DR
This paper argues that AI systems learning about human irrationalities pose greater risks than those learning human values directly, emphasizing the importance of careful value alignment to prevent exploitation.
Contribution
It provides an analysis of the dangers associated with AI learning human irrationalities and proposes a model to compare risks of different levels of information about human behavior.
Findings
Learning human irrationalities is more dangerous than learning human values.
Directly learning human values reduces risks associated with exploitation and misuse.
AI knowledge of human policy increases its power and potential for harm.
Abstract
For an artificial intelligence (AI) to be aligned with human values (or human preferences), it must first learn those values. AI systems that are trained on human behavior, risk miscategorising human irrationalities as human values -- and then optimising for these irrationalities. Simply learning human values still carries risks: AI learning them will inevitably also gain information on human irrationalities and human behaviour/policy. Both of these can be dangerous: knowing human policy allows an AI to become generically more powerful (whether it is partially aligned or not aligned at all), while learning human irrationalities allows it to exploit humans without needing to provide value in return. This paper analyses the danger in developing artificial intelligence that learns about human irrationalities and human policy, and constructs a model recommendation system with various levels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
