CounterMoral: Editing Morals in Language Models
Michael Ripa, Jim Davies

TL;DR
CounterMoral introduces a benchmark dataset to evaluate how effectively current language model editing techniques can modify moral judgments across different ethical frameworks.
Contribution
This work presents a new dataset and evaluation framework for assessing moral judgment editing in language models, addressing a less-explored aspect of model alignment.
Findings
Current editing techniques vary in effectiveness across ethical frameworks.
The benchmark reveals strengths and limitations of existing model editing methods.
Evaluation highlights areas for improving moral alignment in language models.
Abstract
Recent advancements in language model technology have significantly enhanced the ability to edit factual information. Yet, the modification of moral judgments, a crucial aspect of aligning models with human values, has garnered less attention. In this work, we introduce CounterMoral, a benchmark dataset crafted to assess how well current model editing techniques modify moral judgments across diverse ethical frameworks. We apply various editing techniques to multiple language models and evaluate their performance. Our findings contribute to the evaluation of language models designed to be ethical.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
