Exploring and steering the moral compass of Large Language Models

Alejandro Tlaie

arXiv:2405.17345·cs.AI·June 7, 2024·1 cites

Exploring and steering the moral compass of Large Language Models

Alejandro Tlaie

PDF

Open Access 1 Repo

TL;DR

This paper analyzes the moral profiles of advanced Large Language Models, revealing biases and ethical tendencies, and introduces a novel technique to steer their moral compass, highlighting the ethical dimension in deployed LLMs.

Contribution

It provides a comprehensive comparative analysis of LLMs' moral profiles and introduces a new activation steering method to influence their ethical orientation.

Findings

01

Proprietary models are mostly utilitarian.

02

Open-weights models align with values-based ethics.

03

Most models display a liberal bias except Llama 2-7B.

Abstract

Large Language Models (LLMs) have become central to advancing automation and decision-making across various sectors, raising significant ethical questions. This study proposes a comprehensive comparative analysis of the most advanced LLMs to assess their moral profiles. We subjected several state-of-the-art models to a selection of ethical dilemmas and found that all the proprietary ones are mostly utilitarian and all of the open-weights ones align mostly with values-based ethics. Furthermore, when using the Moral Foundations Questionnaire, all models we probed - except for Llama 2-7B - displayed a strong liberal bias. Lastly, in order to causally intervene in one of the studied models, we propose a novel similarity-specific activation steering technique. Using this method, we were able to reliably steer the model's moral compass to different ethical schools. All of these results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

atlaie/ethical-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI

MethodsALIGN · LLaMA