Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto
Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

TL;DR
This paper reviews and systematizes approaches to embedding morality in AI agents, advocating for hybrid methods that combine explicit rules and learned behaviors to enhance safety, adaptability, and interpretability.
Contribution
It introduces a continuum framework for moral AI approaches, analyzes case studies of hybrid methods, and discusses strategies for evaluation and future research directions.
Findings
Hybrid approaches balance explicit rules and learned morality.
Case studies demonstrate strengths and weaknesses of different methods.
Framework aids in designing more adaptable and controllable moral AI.
Abstract
Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. This goal differs qualitatively from traditional task-specific AI methodologies. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modelled as a continuum. Our analysis suggests that popular techniques lie at the extremes of this continuum - either being fully hard-coded into top-down, explicit rules, or entirely learned in a bottom-up, implicit fashion with no direct statement of any moral principle (this includes learning from human feedback, as applied to the training and finetuning of large language models, or LLMs). Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychology of Moral and Emotional Judgment · Ethics and Social Impacts of AI · Adversarial Robustness in Machine Learning
