SaGE: Evaluating Moral Consistency in Large Language Models
Vamshi Krishna Bonagiri, Sreeram Vennam, Priyanshul Govil, Ponnurangam, Kumaraguru, Manas Gaur

TL;DR
This paper introduces SaGE, an information-theoretic measure to evaluate moral consistency in large language models, revealing their inconsistency in moral responses and emphasizing the need for separate assessment of accuracy and consistency.
Contribution
We propose SaGE, a novel measure based on Semantic Graph Entropy, and create the Moral Consistency Corpus to evaluate and analyze moral consistency in LLMs.
Findings
LLMs show significant moral inconsistency.
Task accuracy and moral consistency are independent.
SaGE effectively measures moral consistency across datasets.
Abstract
Despite recent advancements showcasing the impressive capabilities of Large Language Models (LLMs) in conversational systems, we show that even state-of-the-art LLMs are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). Prior works in LLM evaluation focus on developing ground-truth data to measure accuracy on specific tasks. However, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. To address this issue, we propose an information-theoretic measure called Semantic Graph Entropy (SaGE), grounded in the concept of "Rules of Thumb" (RoTs) to measure a model's moral consistency. RoTs are abstract principles learned by a model and can help explain their decision-making strategies effectively. To this extent, we construct the Moral Consistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFocus
