CounterMoral: Editing Morals in Language Models

Michael Ripa; Jim Davies

arXiv:2603.27338·cs.AI·March 31, 2026

CounterMoral: Editing Morals in Language Models

Michael Ripa, Jim Davies

PDF

TL;DR

CounterMoral introduces a benchmark dataset to evaluate how effectively current language model editing techniques can modify moral judgments across different ethical frameworks.

Contribution

This work presents a new dataset and evaluation framework for assessing moral judgment editing in language models, addressing a less-explored aspect of model alignment.

Findings

01

Current editing techniques vary in effectiveness across ethical frameworks.

02

The benchmark reveals strengths and limitations of existing model editing methods.

03

Evaluation highlights areas for improving moral alignment in language models.

Abstract

Recent advancements in language model technology have significantly enhanced the ability to edit factual information. Yet, the modification of moral judgments, a crucial aspect of aligning models with human values, has garnered less attention. In this work, we introduce CounterMoral, a benchmark dataset crafted to assess how well current model editing techniques modify moral judgments across diverse ethical frameworks. We apply various editing techniques to multiple language models and evaluate their performance. Our findings contribute to the evaluation of language models designed to be ethical.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.