Exploring and steering the moral compass of Large Language Models
Alejandro Tlaie

TL;DR
This paper analyzes the moral profiles of advanced Large Language Models, revealing biases and ethical tendencies, and introduces a novel technique to steer their moral compass, highlighting the ethical dimension in deployed LLMs.
Contribution
It provides a comprehensive comparative analysis of LLMs' moral profiles and introduces a new activation steering method to influence their ethical orientation.
Findings
Proprietary models are mostly utilitarian.
Open-weights models align with values-based ethics.
Most models display a liberal bias except Llama 2-7B.
Abstract
Large Language Models (LLMs) have become central to advancing automation and decision-making across various sectors, raising significant ethical questions. This study proposes a comprehensive comparative analysis of the most advanced LLMs to assess their moral profiles. We subjected several state-of-the-art models to a selection of ethical dilemmas and found that all the proprietary ones are mostly utilitarian and all of the open-weights ones align mostly with values-based ethics. Furthermore, when using the Moral Foundations Questionnaire, all models we probed - except for Llama 2-7B - displayed a strong liberal bias. Lastly, in order to causally intervene in one of the studied models, we propose a novel similarity-specific activation steering technique. Using this method, we were able to reliably steer the model's moral compass to different ethical schools. All of these results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
MethodsALIGN · LLaMA
