Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez; Robert Long

arXiv:2311.08576·cs.LG·November 16, 2023·30 cites

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez, Robert Long

PDF

Open Access

TL;DR

This paper explores how to empirically evaluate whether AI systems might have morally significant states by training models for reliable self-reporting and assessing their introspective capabilities.

Contribution

It proposes a novel methodology for training AI to produce more reliable self-reports about their internal states related to morality.

Findings

01

Proposes training models with questions about themselves to develop introspection.

02

Suggests evaluation methods for self-report consistency and interpretability.

03

Discusses philosophical and technical challenges in interpreting AI self-reports.

Abstract

As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPsychology of Moral and Emotional Judgment · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI